* 'git gc auto' didn't trigger on large reflog @ 2025-02-22 22:50 Markus Gerstel 2025-02-24 10:56 ` Patrick Steinhardt 0 siblings, 1 reply; 5+ messages in thread From: Markus Gerstel @ 2025-02-22 22:50 UTC (permalink / raw) To: git Hi everyone, I was looking on a machine that does not normally get any attention. On this machine a daily cronjob has been running git checkout -q master && git fetch && git reset --hard origin/master && git gc --auto for 6 years. The git directory now contains a .git/logs/HEAD file of 180MB with 823921 lines. The repo config contains [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true and the system git version is 2.36.6. I can't change the git version -or install my own one- so I can't tell if this has been fixed since. A manual git gc fixed everything, so I amended the cronjob to just do that instead. I was just slightly surprised (and amused) because I expected 'git gc --auto' to pick this up, so I thought I'd share this with you. Thanks -Markus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 'git gc auto' didn't trigger on large reflog 2025-02-22 22:50 'git gc auto' didn't trigger on large reflog Markus Gerstel @ 2025-02-24 10:56 ` Patrick Steinhardt 2025-02-24 16:43 ` Junio C Hamano 0 siblings, 1 reply; 5+ messages in thread From: Patrick Steinhardt @ 2025-02-24 10:56 UTC (permalink / raw) To: Markus Gerstel; +Cc: git On Sat, Feb 22, 2025 at 10:50:25PM +0000, Markus Gerstel wrote: > Hi everyone, > > I was looking on a machine that does not normally get any attention. On this > machine a daily cronjob has been running > > git checkout -q master && git fetch && git reset --hard origin/master && > git gc --auto > > for 6 years. The git directory now contains a .git/logs/HEAD file of 180MB > with 823921 lines. > > The repo config contains > > [core] > repositoryformatversion = 0 > filemode = true > bare = false > logallrefupdates = true > > and the system git version is 2.36.6. > > I can't change the git version -or install my own one- so I can't tell > if this has been fixed since. A manual git gc fixed everything, so I > amended the cronjob to just do that instead. > > I was just slightly surprised (and amused) because I expected 'git gc > --auto' to pick this up, so I thought I'd share this with you. It's a bit funny, but whether or not `git gc --auto` does anything solely depends on the state of the object database. This is figured out in `need_to_gc()`, which returns a truish value if either: - You have too many packfiles in the repository. - You have too many loose objects in the repository. If these prerequisites aren't met, then git-gc(1) will skip any other work unrelated to objects, as well, including pruning reflogs. So given your above sequence of commands: > git checkout -q master && git fetch && git reset --hard origin/master && > git gc --auto You may hit an edge case, depending on whether or not git-fetch(1) ends up fetching changes. While git-checkout(1) won't write any reflogs if nothing changes, git-reset(1) writes a reflog entry regardless of whether it performs an "actual" change. So if git-fetch(1) ends up never fetching anything you don't accumulate new loose objects or packfiles, but do end up writing a new reflog entry every single time. The conditions mentioned above won't trigger, and thus the reflog is never pruned, either. Patrick ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 'git gc auto' didn't trigger on large reflog 2025-02-24 10:56 ` Patrick Steinhardt @ 2025-02-24 16:43 ` Junio C Hamano 2025-02-26 11:39 ` Patrick Steinhardt 0 siblings, 1 reply; 5+ messages in thread From: Junio C Hamano @ 2025-02-24 16:43 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Markus Gerstel, git Patrick Steinhardt <ps@pks.im> writes: > It's a bit funny, but whether or not `git gc --auto` does anything > solely depends on the state of the object database. I guess after adding "auto", we haven't been careful enough to update the triggering condition as we added new kinds of "garbage" to collect? Should we make an exhausitive and authoritative list of gc tasks, document them, and make sure "--auto" pays attention? Other than objects (packing loose ones, pruning unreferenced loose ones or packing them into cruft packs), we seem to check reflog, worktree, and rerere database. I do not think there is a readily usable API to query how much stale data is in reflogs that are more than N seconds old, without which "gc --auto" cannot make decisions. I am reasonably sure rerere API does not give you such data, either. I have no idea about the triggering condition of "worktree prune". ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 'git gc auto' didn't trigger on large reflog 2025-02-24 16:43 ` Junio C Hamano @ 2025-02-26 11:39 ` Patrick Steinhardt 2025-02-26 16:10 ` Junio C Hamano 0 siblings, 1 reply; 5+ messages in thread From: Patrick Steinhardt @ 2025-02-26 11:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: Markus Gerstel, git On Mon, Feb 24, 2025 at 08:43:23AM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > It's a bit funny, but whether or not `git gc --auto` does anything > > solely depends on the state of the object database. > > I guess after adding "auto", we haven't been careful enough to > update the triggering condition as we added new kinds of "garbage" > to collect? Should we make an exhausitive and authoritative list of > gc tasks, document them, and make sure "--auto" pays attention? Maybe. But maybe a better solution would be to build this into git-maintenance(1) instead, which is a lot more fine-grained. It already has properly defined subtasks, and each of these subtasks has an optional callback function that makes it only run as-needed. So from my perspective we should: - Expand git-maintenance(1) to gain a new task for expiring reflogs. - Adapt it to not use git-gc(1) anymore, but instead use the specific subtasks. It also allows us to iterate a lot more on the actual tasks run by the command and make them configurable. It would for example allow us to eventually enable incremental repacking via multi-pack indices or geometric repacking. > Other than objects (packing loose ones, pruning unreferenced loose > ones or packing them into cruft packs), we seem to check reflog, > worktree, and rerere database. > > I do not think there is a readily usable API to query how much stale > data is in reflogs that are more than N seconds old, without which > "gc --auto" cannot make decisions. I am reasonably sure rerere API > does not give you such data, either. I have no idea about the > triggering condition of "worktree prune". No, there isn't, and computing it is also potentially expensive. You basically have to iterate through each reflog and then also iterate through all of its reflog entries to figure out whether anything needs cleaning or not. But probably we can come up with clever heuristics instead that don't require us to be this thorough. We could for example just read the "HEAD" reflog and figure out whether it contains reflog entries that would be pruned. Patrick ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 'git gc auto' didn't trigger on large reflog 2025-02-26 11:39 ` Patrick Steinhardt @ 2025-02-26 16:10 ` Junio C Hamano 0 siblings, 0 replies; 5+ messages in thread From: Junio C Hamano @ 2025-02-26 16:10 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Markus Gerstel, git Patrick Steinhardt <ps@pks.im> writes: > No, there isn't, and computing it is also potentially expensive. You > basically have to iterate through each reflog and then also iterate > through all of its reflog entries to figure out whether anything needs > cleaning or not. > > But probably we can come up with clever heuristics instead that don't > require us to be this thorough. We could for example just read the > "HEAD" reflog and figure out whether it contains reflog entries that > would be pruned. As we should be able to "seek" to implement HEAD@{2.months.ago}, I'd imagine that we should be able to ask "give me the oldest entry in your log" to a ref. Ask that question to a handful of refs that have been most recently modified (with the theory that a ref that is more often modified is also likely to have been touched in the recent past---your HEAD heuristics is a good approximation), and we learn fairly cheaply if it is likely that we have entries to be expired. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-02-26 16:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-22 22:50 'git gc auto' didn't trigger on large reflog Markus Gerstel 2025-02-24 10:56 ` Patrick Steinhardt 2025-02-24 16:43 ` Junio C Hamano 2025-02-26 11:39 ` Patrick Steinhardt 2025-02-26 16:10 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).