git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 'git gc auto' didn't trigger on large reflog
@ 2025-02-22 22:50 Markus Gerstel
  2025-02-24 10:56 ` Patrick Steinhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Gerstel @ 2025-02-22 22:50 UTC (permalink / raw)
  To: git

Hi everyone,

I was looking on a machine that does not normally get any attention. On 
this machine a daily cronjob has been running

     git checkout -q master && git fetch && git reset --hard 
origin/master && git gc --auto

for 6 years. The git directory now contains a .git/logs/HEAD file of 
180MB with 823921 lines.

The repo config contains

[core]
         repositoryformatversion = 0
         filemode = true
         bare = false
         logallrefupdates = true

and the system git version is 2.36.6.

I can't change the git version -or install my own one- so I can't tell 
if this has been fixed since.
A manual git gc fixed everything, so I amended the cronjob to just do 
that instead.

I was just slightly surprised (and amused) because I expected 'git gc 
--auto' to pick this up, so I thought I'd share this with you.

Thanks

-Markus


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 'git gc auto' didn't trigger on large reflog
  2025-02-22 22:50 'git gc auto' didn't trigger on large reflog Markus Gerstel
@ 2025-02-24 10:56 ` Patrick Steinhardt
  2025-02-24 16:43   ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Steinhardt @ 2025-02-24 10:56 UTC (permalink / raw)
  To: Markus Gerstel; +Cc: git

On Sat, Feb 22, 2025 at 10:50:25PM +0000, Markus Gerstel wrote:
> Hi everyone,
> 
> I was looking on a machine that does not normally get any attention. On this
> machine a daily cronjob has been running
> 
>     git checkout -q master && git fetch && git reset --hard origin/master &&
> git gc --auto
> 
> for 6 years. The git directory now contains a .git/logs/HEAD file of 180MB
> with 823921 lines.
> 
> The repo config contains
> 
> [core]
>         repositoryformatversion = 0
>         filemode = true
>         bare = false
>         logallrefupdates = true
> 
> and the system git version is 2.36.6.
> 
> I can't change the git version -or install my own one- so I can't tell
> if this has been fixed since. A manual git gc fixed everything, so I
> amended the cronjob to just do that instead.
> 
> I was just slightly surprised (and amused) because I expected 'git gc
> --auto' to pick this up, so I thought I'd share this with you.

It's a bit funny, but whether or not `git gc --auto` does anything
solely depends on the state of the object database. This is figured out
in `need_to_gc()`, which returns a truish value if either:

  - You have too many packfiles in the repository.

  - You have too many loose objects in the repository.

If these prerequisites aren't met, then git-gc(1) will skip any other
work unrelated to objects, as well, including pruning reflogs.

So given your above sequence of commands:

>     git checkout -q master && git fetch && git reset --hard origin/master &&
> git gc --auto

You may hit an edge case, depending on whether or not git-fetch(1) ends
up fetching changes. While git-checkout(1) won't write any reflogs if
nothing changes, git-reset(1) writes a reflog entry regardless of
whether it performs an "actual" change. So if git-fetch(1) ends up never
fetching anything you don't accumulate new loose objects or packfiles,
but do end up writing a new reflog entry every single time. The
conditions mentioned above won't trigger, and thus the reflog is never
pruned, either.

Patrick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 'git gc auto' didn't trigger on large reflog
  2025-02-24 10:56 ` Patrick Steinhardt
@ 2025-02-24 16:43   ` Junio C Hamano
  2025-02-26 11:39     ` Patrick Steinhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2025-02-24 16:43 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Markus Gerstel, git

Patrick Steinhardt <ps@pks.im> writes:

> It's a bit funny, but whether or not `git gc --auto` does anything
> solely depends on the state of the object database.

I guess after adding "auto", we haven't been careful enough to
update the triggering condition as we added new kinds of "garbage"
to collect?  Should we make an exhausitive and authoritative list of
gc tasks, document them, and make sure "--auto" pays attention?

Other than objects (packing loose ones, pruning unreferenced loose
ones or packing them into cruft packs), we seem to check reflog,
worktree, and rerere database.

I do not think there is a readily usable API to query how much stale
data is in reflogs that are more than N seconds old, without which
"gc --auto" cannot make decisions.  I am reasonably sure rerere API
does not give you such data, either.  I have no idea about the
triggering condition of "worktree prune".



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 'git gc auto' didn't trigger on large reflog
  2025-02-24 16:43   ` Junio C Hamano
@ 2025-02-26 11:39     ` Patrick Steinhardt
  2025-02-26 16:10       ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Steinhardt @ 2025-02-26 11:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Markus Gerstel, git

On Mon, Feb 24, 2025 at 08:43:23AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > It's a bit funny, but whether or not `git gc --auto` does anything
> > solely depends on the state of the object database.
> 
> I guess after adding "auto", we haven't been careful enough to
> update the triggering condition as we added new kinds of "garbage"
> to collect?  Should we make an exhausitive and authoritative list of
> gc tasks, document them, and make sure "--auto" pays attention?

Maybe. But maybe a better solution would be to build this into
git-maintenance(1) instead, which is a lot more fine-grained. It already
has properly defined subtasks, and each of these subtasks has an
optional callback function that makes it only run as-needed.

So from my perspective we should:

  - Expand git-maintenance(1) to gain a new task for expiring reflogs.

  - Adapt it to not use git-gc(1) anymore, but instead use the specific
    subtasks.

It also allows us to iterate a lot more on the actual tasks run by the
command and make them configurable. It would for example allow us to
eventually enable incremental repacking via multi-pack indices or
geometric repacking.

> Other than objects (packing loose ones, pruning unreferenced loose
> ones or packing them into cruft packs), we seem to check reflog,
> worktree, and rerere database.
> 
> I do not think there is a readily usable API to query how much stale
> data is in reflogs that are more than N seconds old, without which
> "gc --auto" cannot make decisions.  I am reasonably sure rerere API
> does not give you such data, either.  I have no idea about the
> triggering condition of "worktree prune".

No, there isn't, and computing it is also potentially expensive. You
basically have to iterate through each reflog and then also iterate
through all of its reflog entries to figure out whether anything needs
cleaning or not.

But probably we can come up with clever heuristics instead that don't
require us to be this thorough. We could for example just read the
"HEAD" reflog and figure out whether it contains reflog entries that
would be pruned.

Patrick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 'git gc auto' didn't trigger on large reflog
  2025-02-26 11:39     ` Patrick Steinhardt
@ 2025-02-26 16:10       ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2025-02-26 16:10 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Markus Gerstel, git

Patrick Steinhardt <ps@pks.im> writes:

> No, there isn't, and computing it is also potentially expensive. You
> basically have to iterate through each reflog and then also iterate
> through all of its reflog entries to figure out whether anything needs
> cleaning or not.
>
> But probably we can come up with clever heuristics instead that don't
> require us to be this thorough. We could for example just read the
> "HEAD" reflog and figure out whether it contains reflog entries that
> would be pruned.

As we should be able to "seek" to implement HEAD@{2.months.ago}, I'd
imagine that we should be able to ask "give me the oldest entry in
your log" to a ref.  Ask that question to a handful of refs that
have been most recently modified (with the theory that a ref that is
more often modified is also likely to have been touched in the
recent past---your HEAD heuristics is a good approximation), and we
learn fairly cheaply if it is likely that we have entries to be
expired.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-02-26 16:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-22 22:50 'git gc auto' didn't trigger on large reflog Markus Gerstel
2025-02-24 10:56 ` Patrick Steinhardt
2025-02-24 16:43   ` Junio C Hamano
2025-02-26 11:39     ` Patrick Steinhardt
2025-02-26 16:10       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).