git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Consider adding pruning of refs to git maintenance
@ 2024-12-16 12:33 Shubham Kanodia
  2024-12-17  7:21 ` Patrick Steinhardt
  0 siblings, 1 reply; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-16 12:33 UTC (permalink / raw)
  To: git, Patrick Steinhardt

Remote-tracking refs accumulate quickly in large repositories as users
merge and delete their branches. While these branches are cleaned up
on the remote, local repositories may retain stale references to
deleted branches unless explicitly pruned. The number of local refs
can have an impact on git performance of several commands.

Git currently provides two ways for orphan local refs to be cleaned up —
1. Automated: `fetch.prune` and `fetch.pruneTags` configurations with
`git fetch/pull`
2. Manual: `git remote prune`

However, both approaches have issues:
- Full `git fetch/pull` operations are expensive on large
repositories, pulling thousands of irrelevant refs
- Manual `git remote prune` requires user intervention

Proposal:
Add remote pruning to the daily `git-maintenance` task. This would
clean stale refs automatically without requiring full fetches or
manual intervention.

This is especially useful for users who historically pulled all
refs/tags but now use targeted fetches. Moreover, it decouples the
cleanup action (pruning) from the action to fetch more refs.

Happy to submit on a patch for the same unless there's something
obvious that I've missed here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-16 12:33 Consider adding pruning of refs to git maintenance Shubham Kanodia
@ 2024-12-17  7:21 ` Patrick Steinhardt
  2024-12-17  7:41   ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Steinhardt @ 2024-12-17  7:21 UTC (permalink / raw)
  To: Shubham Kanodia; +Cc: git, Derrick Stolee

On Mon, Dec 16, 2024 at 06:03:03PM +0530, Shubham Kanodia wrote:
> Remote-tracking refs accumulate quickly in large repositories as users
> merge and delete their branches. While these branches are cleaned up
> on the remote, local repositories may retain stale references to
> deleted branches unless explicitly pruned. The number of local refs
> can have an impact on git performance of several commands.
> 
> Git currently provides two ways for orphan local refs to be cleaned up —
> 1. Automated: `fetch.prune` and `fetch.pruneTags` configurations with
> `git fetch/pull`
> 2. Manual: `git remote prune`
> 
> However, both approaches have issues:
> - Full `git fetch/pull` operations are expensive on large
> repositories, pulling thousands of irrelevant refs
> - Manual `git remote prune` requires user intervention

Fair. Neither of those issues feel insurmountable, but I can see why it
could make our users lifes easier.

> Proposal:
> Add remote pruning to the daily `git-maintenance` task. This would
> clean stale refs automatically without requiring full fetches or
> manual intervention.
> 
> This is especially useful for users who historically pulled all
> refs/tags but now use targeted fetches. Moreover, it decouples the
> cleanup action (pruning) from the action to fetch more refs.

I think we need to consider a couple of things:

  - It's somewhat awkward to have maintenance jobs that interact with a
    remote, as that may not work in contexts where you actually need to
    authenticate. But there is precedent with the "prefetch" task, so we
    have already opened that can of worms.

  - Maintenance tries to be as non-destructive as reasonably possible,
    and deleting refs certainly is a destructive operation.

  - We try to avoid bad interactions with a user that works concurrently
    in the repo that git-maintenance(1) runs in. This is the reason why
    the "prefetch" task does not fetch into `refs/remotes`, but into a
    separate ref namespace.

If we want to have such a feature I'd thus claim that it would be most
sensible to make it opt-in rather than opt-out. I wouldn't want to be
surprised by remote refs vanishing after going to bed, but may be okay
with it when I explicitly ask for it.

At that point one has to raise the question whether it is still all that
useful compared to running `git remote prune` manually every now and
then. Mostly because explicitly configuring maintenance is probably
something that only power users would do, and those power users would
likely know to prune manually.

In any case, that's just my 2c. I can see a usecase for your feature,
but think we should be careful with how it is introduced.

> Happy to submit on a patch for the same unless there's something
> obvious that I've missed here.

I'm happy to have a look in case you decide to implement this feature.

Patrick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-17  7:21 ` Patrick Steinhardt
@ 2024-12-17  7:41   ` Junio C Hamano
  2024-12-17 11:21     ` Shubham Kanodia
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-12-17  7:41 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Shubham Kanodia, git, Derrick Stolee

Patrick Steinhardt <ps@pks.im> writes:

> If we want to have such a feature I'd thus claim that it would be most
> sensible to make it opt-in rather than opt-out. I wouldn't want to be
> surprised by remote refs vanishing after going to bed, but may be okay
> with it when I explicitly ask for it.
>
> At that point one has to raise the question whether it is still all that
> useful compared to running `git remote prune` manually every now and
> then. Mostly because explicitly configuring maintenance is probably
> something that only power users would do, and those power users would
> likely know to prune manually.

I am 100% in agreement with your reasoning.  If this thing is to
exist, it has to be opt-in.  We also need to add ample warning why
doing this asynchronously behind user's back while the user could be
working in the same repository is prone to cause surprises in the
documentation in big red letters.

I however am OK with the idea of having it as an optional "task"
that is disabled by default in "git maintenance".  "git maintenance"
can be viewed as a platform-neutral way to set up scheduled tasks.

A user may be a Git expert who knows what `git remote prune` does,
and understands and accepts the risk of confusion doing so without
explicit "do it now" from the end-user, but may not be a Linux or a
macOS or a Windows expert to know how to write crontab or its
equivalents on various schemes to define scheduled tasks.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-17  7:41   ` Junio C Hamano
@ 2024-12-17 11:21     ` Shubham Kanodia
  2024-12-17 11:24       ` Shubham Kanodia
  0 siblings, 1 reply; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-17 11:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee

On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Patrick Steinhardt <ps@pks.im> writes:
>
> > If we want to have such a feature I'd thus claim that it would be most
> > sensible to make it opt-in rather than opt-out. I wouldn't want to be
> > surprised by remote refs vanishing after going to bed, but may be okay
> > with it when I explicitly ask for it.
> >
> > At that point one has to raise the question whether it is still all that
> > useful compared to running `git remote prune` manually every now and
> > then. Mostly because explicitly configuring maintenance is probably
> > something that only power users would do, and those power users would
> > likely know to prune manually.
>
> I am 100% in agreement with your reasoning.  If this thing is to
> exist, it has to be opt-in.  We also need to add ample warning why
> doing this asynchronously behind user's back while the user could be
> working in the same repository is prone to cause surprises in the
> documentation in big red letters.
>
> I however am OK with the idea of having it as an optional "task"
> that is disabled by default in "git maintenance".  "git maintenance"
> can be viewed as a platform-neutral way to set up scheduled tasks.
>
> A user may be a Git expert who knows what `git remote prune` does,
> and understands and accepts the risk of confusion doing so without
> explicit "do it now" from the end-user, but may not be a Linux or a
> macOS or a Windows expert to know how to write crontab or its
> equivalents on various schemes to define scheduled tasks.
>
> Thanks.

Thanks for sharing your thoughts here. For some context — I currently
look after git performance for a very large repository (60k+ refs)
with a lot of active developers.
I've observed that while power git users keep their repository tidy
even without runnining maintenance, the majority of users infact don't
know (or execute these), and
run into performance issues. Easier fleet management was perhaps was
one of the motivations behind adding this to `git-maintenance`.

While its pretty rare that someone relies on a `refs/remote/*`
reference (without creating a local copy of it in `refs/heads`), I can
see why it can be surprising and an opt-in
should be okay for me.

Thanks,
Shubham K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-17 11:21     ` Shubham Kanodia
@ 2024-12-17 11:24       ` Shubham Kanodia
  2024-12-17 19:56         ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-17 11:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee

On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia
<shubham.kanodia10@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Patrick Steinhardt <ps@pks.im> writes:
> >
> > > If we want to have such a feature I'd thus claim that it would be most
> > > sensible to make it opt-in rather than opt-out. I wouldn't want to be
> > > surprised by remote refs vanishing after going to bed, but may be okay
> > > with it when I explicitly ask for it.
> > >
> > > At that point one has to raise the question whether it is still all that
> > > useful compared to running `git remote prune` manually every now and
> > > then. Mostly because explicitly configuring maintenance is probably
> > > something that only power users would do, and those power users would
> > > likely know to prune manually.
> >
> > I am 100% in agreement with your reasoning.  If this thing is to
> > exist, it has to be opt-in.  We also need to add ample warning why
> > doing this asynchronously behind user's back while the user could be
> > working in the same repository is prone to cause surprises in the
> > documentation in big red letters.
> >
> > I however am OK with the idea of having it as an optional "task"
> > that is disabled by default in "git maintenance".  "git maintenance"
> > can be viewed as a platform-neutral way to set up scheduled tasks.
> >
> > A user may be a Git expert who knows what `git remote prune` does,
> > and understands and accepts the risk of confusion doing so without
> > explicit "do it now" from the end-user, but may not be a Linux or a
> > macOS or a Windows expert to know how to write crontab or its
> > equivalents on various schemes to define scheduled tasks.
> >
> > Thanks.
>
> Thanks for sharing your thoughts here. For some context — I currently
> look after git performance for a very large repository (60k+ refs)
> with a lot of active developers.
> I've observed that while power git users keep their repository tidy
> even without runnining maintenance, the majority of users infact don't
> know (or execute these), and
> run into performance issues. Easier fleet management was perhaps was
> one of the motivations behind adding this to `git-maintenance`.
>
> While its pretty rare that someone relies on a `refs/remote/*`
> reference (without creating a local copy of it in `refs/heads`), I can
> see why it can be surprising and an opt-in
> should be okay for me.
>
> Thanks,
> Shubham K

From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001
From: Shubham Kanodia <shubham.kanodia10@gmail.com>
Date: Tue, 17 Dec 2024 16:05:16 +0530
Subject: [PATCH] maintenance: add prune-remote-refs task

Remote-tracking refs can accumulate in local repositories even as branches
are deleted on remotes, impacting git performance. While git fetch --prune
helps clean these up, it requires a full fetch operation which is expensive
for large repositories.

Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune'
for each configured remote. This provides an automated way to clean up stale
remote-tracking refs without requiring full fetches or manual intervention.
This task is disabled by default.

Signed-off-by: Shubham Kanodia <shubham.kanodia10@gmail.com>
---
 Documentation/git-maintenance.txt | 14 ++++++++++
 builtin/gc.c                      | 42 +++++++++++++++++++++++++++++
 t/t7900-maintenance.sh            | 44 +++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/Documentation/git-maintenance.txt
b/Documentation/git-maintenance.txt
index 6e6651309d..3e0ca84de2 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -158,6 +158,20 @@ pack-refs::
  need to iterate across many references. See linkgit:git-pack-refs[1]
  for more information.

+prune-remote-refs::
+ The `prune-remote-refs` task runs `git remote prune` on each remote
+ repository registered in the local repository. This task helps clean
+ up deleted remote branches, improving the performance of operations
+ that iterate through the refs. See linkgit:git-remote[1] for more
+ information. This task is disabled by default and must be enabled
+ manually.
++
+NOTE: This task is opt-in to prevent unexpected removal of remote refs
+for users of git-maintenance. It is recommended for power users who
+are comfortable with configuring maintenance tasks and understand the
+implications of automatic pruning. For others, running `git remote prune`
+manually might be more appropriate.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 4ae5196aed..9acf1d2989 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,7 @@
 #include "lockfile.h"
 #include "parse-options.h"
 #include "run-command.h"
+#include "remote.h"
 #include "sigchain.h"
 #include "strvec.h"
 #include "commit.h"
@@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct
option *opt, const char *arg,
  return 0;
 }

+static int collect_remote(struct remote *remote, void *cb_data)
+{
+ struct string_list *list = cb_data;
+
+ if (!remote->url.nr)
+ return 0;
+
+ string_list_append(list, remote->name);
+ return 0;
+}
+
+static int maintenance_task_prune_remote(struct maintenance_run_opts
*opts UNUSED,
+ struct gc_config *cfg UNUSED)
+{
+ struct string_list_item *item;
+ struct string_list remotes_list = STRING_LIST_INIT_NODUP;
+ struct child_process child = CHILD_PROCESS_INIT;
+ int result = 0;
+
+ for_each_remote(collect_remote, &remotes_list);
+
+ for_each_string_list_item (item, &remotes_list) {
+ const char *remote_name = item->string;
+ child.git_cmd = 1;
+ strvec_pushl(&child.args, "remote", "prune", remote_name, NULL);
+
+ if (run_command(&child))
+ result = error(_("failed to prune '%s'"), remote_name);
+ }
+
+ string_list_clear(&remotes_list, 0);
+ return result;
+}
+
 /* Remember to update object flag allocation in object.h */
 #define SEEN (1u<<0)

@@ -1375,6 +1410,7 @@ enum maintenance_task_label {
  TASK_GC,
  TASK_COMMIT_GRAPH,
  TASK_PACK_REFS,
+ TASK_PRUNE_REMOTE_REFS,

  /* Leave as final value */
  TASK__COUNT
@@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = {
  maintenance_task_pack_refs,
  pack_refs_condition,
  },
+ [TASK_PRUNE_REMOTE_REFS] = {
+ "prune-remote-refs",
+ maintenance_task_prune_remote,
+ },
 };

 static int compare_tasks_by_selection(const void *a_, const void *b_)
@@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void)
  tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY;
  tasks[TASK_PACK_REFS].enabled = 1;
  tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY;
+ tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0;
+ tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY;
  }
 }

diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 0ce4ba1cbe..60a0c3f835 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' '
  test_subcommand git pack-refs --all --prune <pack-refs.txt
 '

+test_expect_success 'prune-remote-refs task not enabled by default' '
+ git clone . prune-test &&
+ (
+ cd prune-test &&
+ GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err &&
+ test_subcommand ! git remote prune origin <prune.txt
+ )
+'
+
+test_expect_success 'prune-remote-refs task cleans stale remote refs' '
+ test_commit initial &&
+
+ # Create two separate remote repos
+ git clone . remote1 &&
+ git clone . remote2 &&
+
+ git clone . prune-test-clean &&
+ (
+ cd prune-test-clean &&
+ git config maintenance.prune-remote-refs.enabled true &&
+
+ # Add both remotes
+ git remote add remote1 "../remote1" &&
+ git remote add remote2 "../remote2" &&
+
+ # Create and push branches to both remotes
+ git branch -f side2 HEAD &&
+ git push remote1 side2 &&
+ git push remote2 side2 &&
+
+ # Rename branches in each remote to simulate a stale branch
+ git -C ../remote1 branch -m side2 side3 &&
+ git -C ../remote2 branch -m side2 side4 &&
+
+ GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run
--task=prune-remote-refs &&
+
+ # Verify pruning happened for both remotes
+ test_subcommand git remote prune remote1 <prune.txt &&
+ test_subcommand git remote prune remote2 <prune.txt &&
+ test_must_fail git rev-parse refs/remotes/remote1/side2 &&
+ test_must_fail git rev-parse refs/remotes/remote2/side2
+ )
+'
+
 test_expect_success '--auto and --schedule incompatible' '
  test_must_fail git maintenance run --auto --schedule=daily 2>err &&
  test_grep "at most one" err
-- 
2.46.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-17 11:24       ` Shubham Kanodia
@ 2024-12-17 19:56         ` Junio C Hamano
  2024-12-18  8:30           ` Shubham Kanodia
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-12-17 19:56 UTC (permalink / raw)
  To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee

Shubham Kanodia <shubham.kanodia10@gmail.com> writes:

> On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia
> <shubham.kanodia10@gmail.com> wrote:
>>
>> On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>> >
>> > Patrick Steinhardt <ps@pks.im> writes:
>> >
>> > > If we want to have such a feature I'd thus claim that it would be most
>> > > sensible to make it opt-in rather than opt-out. I wouldn't want to be
>> > > surprised by remote refs vanishing after going to bed, but may be okay
>> > > with it when I explicitly ask for it.
>> > >
>> > > At that point one has to raise the question whether it is still all that
>> > > useful compared to running `git remote prune` manually every now and
>> > > then. Mostly because explicitly configuring maintenance is probably
>> > > something that only power users would do, and those power users would
>> > > likely know to prune manually.
>> >
>> > I am 100% in agreement with your reasoning.  If this thing is to
>> > exist, it has to be opt-in.  We also need to add ample warning why
>> > doing this asynchronously behind user's back while the user could be
>> > working in the same repository is prone to cause surprises in the
>> > documentation in big red letters.
>> >
>> > I however am OK with the idea of having it as an optional "task"
>> > that is disabled by default in "git maintenance".  "git maintenance"
>> > can be viewed as a platform-neutral way to set up scheduled tasks.
>> >
>> > A user may be a Git expert who knows what `git remote prune` does,
>> > and understands and accepts the risk of confusion doing so without
>> > explicit "do it now" from the end-user, but may not be a Linux or a
>> > macOS or a Windows expert to know how to write crontab or its
>> > equivalents on various schemes to define scheduled tasks.
>> >
>> > Thanks.
>>
>> Thanks for sharing your thoughts here. For some context — I currently
>> look after git performance for a very large repository (60k+ refs)
>> with a lot of active developers.
>> I've observed that while power git users keep their repository tidy
>> even without runnining maintenance, the majority of users infact don't
>> know (or execute these), and
>> run into performance issues. Easier fleet management was perhaps was
>> one of the motivations behind adding this to `git-maintenance`.
>>
>> While its pretty rare that someone relies on a `refs/remote/*`
>> reference (without creating a local copy of it in `refs/heads`), I can
>> see why it can be surprising and an opt-in
>> should be okay for me.
>>
>> Thanks,
>> Shubham K

If you want to mix a patch with other discussion, write a scissors
line (e.g. "--- >8 --- cut here --- >8 ---" and nothing else on its
own line) here before the patch begins, and tell readers that they
need to scroll 50+ lines before they can see the patch at the
beginning of your message if the leading part needs to be so long.

> From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001

Never include this line in the body of the message.

> From: Shubham Kanodia <shubham.kanodia10@gmail.com>

Having "From:" in the body of the message at the beginning of your
patch is a good way to override the author identity and set it
differently from what your MUA places on the From: header line,
when you are forwarding somebody else's patch and want to record the
author identity of the real author, but then this line would not
have your name-email.  Another use case is when your MUA cannot be
convinced to place your author identity on its From: header, but the
message I am responding to has the same name-address in the From:
header.

So in conclusion, the above line is unnecessary in the body of this
message.

> Date: Tue, 17 Dec 2024 16:05:16 +0530

Rarely it is a good idea to include this line in the body of the
message.  The Date: header records the time the general public saw
the patch for the first time, which is the author date as far as the
general public is concered, much better.

> Subject: [PATCH] maintenance: add prune-remote-refs task

Overriding "Subject:" header of the e-mail with this in-body header
is almost always needed when you are responding to an ongoing
discussion and including a patch after a scissors line.

So in short, "From/From:/Date:" should be dropped, "Subject:" kept,
and insert a scissors line before it.

> Remote-tracking refs can accumulate in local repositories even as branches
> are deleted on remotes, impacting git performance.

OK.  It usually is better to be explicit about the direction of the
"impact".  E.g., "impacting git performance negatively".

Have you considered using the reftable backend, instead of the files
backend, by the way?  It is rumored to scale much better with the
number of refs.

It is possible that the user deliberately kept stale remote-tracking
branches even after they are removed from the origin because they
used them as the bases of their topic branches, in which case it
would not be a good idea to blindly remove them, but I do not think
that needs to be said here.  These users would not be running
"git remote prune".

> While git fetch --prune
> helps clean these up, it requires a full fetch operation which is expensive
> for large repositories.

It is curious why "git fetch --prune" is mentioned here, especially
when you know "git remote prune" is much more appropriate operation
for that purpose.  The real justification this paragraph needed to
make is why it is a good idea to a new task in "git maintenance",
rather than letting the user run "git remote prune" themselves,
isn't it?

> Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune'
> for each configured remote. This provides an automated way to clean up stale
> remote-tracking refs without requiring full fetches or manual intervention.

Again, "requiring full fetches or" is probably out of place.

> This task is disabled by default.

If we wanted to mention "fetch --prune", it is probably not as an
inferiour alternative to what you invented here, but as a different
way the user can work that makes the problem you are trying to solve
go away.  Perhaps near the beginning, after the "Remote-tracking
refs can accumulate and impact performance negatively", say
something like:

    ... negatively.  The user can have the "fetch.prune"
    configuration variable set to true when interacting with the
    other side, and stale remote-tracking refs will automatically
    pruned at zero cost.

Of course "remote.<name>.prune" is a more targetted way that works
per remote, but your prune-remote-refs works on all remotes,
fetch.prune is a better contrast.

In any case, stepping back a bit, for the population of user who
benefit from enabing the prune-remote-refs task, wouldn't it be an
even better solution for them to set fetch.prune?  You can tell them
to run "git remote prune" just once, set that configuration
variable, and then the remote-tracking branches will stay clean from
then on.  Any future interactions with the remote make sure stale
remote-tracking branches will be removed automatically.  Wouldn't
that be a much better option?  I am sure I must be missing a use
case where fetch.prune (or remote.<name>.prune) is not a good idea
but background prune-remote-refs task works better.

So, if we want to still add this new maintenance task, we should
explain *that* use case after the proposed addition of "you can do
with fetch.prune at zero cost" to the first paragraph.  IOW, before
starting "Add a new maintenance task ...", we need to say something
like:

    There however are cases where pruning stale remote-tracking refs
    at the "fetch" time is undesirable, yet keeping them indefinitely
    is also unacceptable.  <<HERE YOU DESCRIBE THE USE CASE WITH
    SUCH CHARACTERISTICS>>.  To help such use cases, add a new
    maintenance task ...

And if there is no such workflow where a background prune-remote-refs
task is needed and fetch.prune would not work well, then we do not
need a new task, do we?

Let me review the change (assuming that there are cases where
fetch.prune is not a good solution) below.

> +prune-remote-refs::
> + The `prune-remote-refs` task runs `git remote prune` on each remote
> + repository registered in the local repository. This task helps clean
> + up deleted remote branches, improving the performance of operations
> + that iterate through the refs. See linkgit:git-remote[1] for more
> + information. This task is disabled by default and must be enabled
> + manually.
> ++
> +NOTE: This task is opt-in to prevent unexpected removal of remote refs
> +for users of git-maintenance. It is recommended for power users who
> +are comfortable with configuring maintenance tasks and understand the
> +implications of automatic pruning. For others, running `git remote prune`
> +manually might be more appropriate.

"and must be enabled" is unnecessary---that is what "disabled by
default" means.  Also, we probably would want to add a sentence or
two to contrast this with setting fetch.prune to highlight why users
would want to consider enabling this (instead of setting fetch.prune).

Otherwise, nicely written.

>  OPTIONS
>  -------
>  --auto::
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 4ae5196aed..9acf1d2989 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -20,6 +20,7 @@
>  #include "lockfile.h"
>  #include "parse-options.h"
>  #include "run-command.h"
> +#include "remote.h"
>  #include "sigchain.h"
>  #include "strvec.h"
>  #include "commit.h"
> @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct
> option *opt, const char *arg,
>   return 0;
>  }
>
> +static int collect_remote(struct remote *remote, void *cb_data)
> +{
> + struct string_list *list = cb_data;
> +
> + if (!remote->url.nr)
> + return 0;
> +
> + string_list_append(list, remote->name);
> + return 0;
> +}
> +
> +static int maintenance_task_prune_remote(struct maintenance_run_opts
> *opts UNUSED,
> + struct gc_config *cfg UNUSED)
> +{
> + struct string_list_item *item;
> + struct string_list remotes_list = STRING_LIST_INIT_NODUP;
> + struct child_process child = CHILD_PROCESS_INIT;
> + int result = 0;
> +
> + for_each_remote(collect_remote, &remotes_list);
> +
> + for_each_string_list_item (item, &remotes_list) {
> + const char *remote_name = item->string;

Is the patch whitespace-damaged?  The indentation seems to be gone.
I won't comment on misindented lines and blocks in this review (i.e.
even if the original had such problems, I wouldn't be seeing them in
the message I am responding to).

> + child.git_cmd = 1;
> + strvec_pushl(&child.args, "remote", "prune", remote_name, NULL);
> +
> + if (run_command(&child))
> + result = error(_("failed to prune '%s'"), remote_name);
> + }
> +
> + string_list_clear(&remotes_list, 0);
> + return result;
> +}

OK.

>  /* Remember to update object flag allocation in object.h */
>  #define SEEN (1u<<0)
>
> @@ -1375,6 +1410,7 @@ enum maintenance_task_label {
>   TASK_GC,
>   TASK_COMMIT_GRAPH,
>   TASK_PACK_REFS,
> + TASK_PRUNE_REMOTE_REFS,
>
>   /* Leave as final value */
>   TASK__COUNT
> @@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = {
>   maintenance_task_pack_refs,
>   pack_refs_condition,
>   },
> + [TASK_PRUNE_REMOTE_REFS] = {
> + "prune-remote-refs",
> + maintenance_task_prune_remote,
> + },
>  };
>
>  static int compare_tasks_by_selection(const void *a_, const void *b_)
> @@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void)
>   tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY;
>   tasks[TASK_PACK_REFS].enabled = 1;
>   tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY;
> + tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0;
> + tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY;
>   }
>  }

OK.

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index 0ce4ba1cbe..60a0c3f835 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' '
>   test_subcommand git pack-refs --all --prune <pack-refs.txt
>  '
>
> +test_expect_success 'prune-remote-refs task not enabled by default' '
> + git clone . prune-test &&
> + (
> + cd prune-test &&
> + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err &&
> + test_subcommand ! git remote prune origin <prune.txt
> + )
> +'

OK.

> +test_expect_success 'prune-remote-refs task cleans stale remote refs' '
> + test_commit initial &&
> +
> + # Create two separate remote repos
> + git clone . remote1 &&
> + git clone . remote2 &&
> +
> + git clone . prune-test-clean &&
> + (
> + cd prune-test-clean &&
> + git config maintenance.prune-remote-refs.enabled true &&
> +
> + # Add both remotes
> + git remote add remote1 "../remote1" &&
> + git remote add remote2 "../remote2" &&
> +
> + # Create and push branches to both remotes
> + git branch -f side2 HEAD &&
> + git push remote1 side2 &&
> + git push remote2 side2 &&
> +
> + # Rename branches in each remote to simulate a stale branch
> + git -C ../remote1 branch -m side2 side3 &&
> + git -C ../remote2 branch -m side2 side4 &&
> +
> + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run
> --task=prune-remote-refs &&
> +
> + # Verify pruning happened for both remotes
> + test_subcommand git remote prune remote1 <prune.txt &&
> + test_subcommand git remote prune remote2 <prune.txt &&
> + test_must_fail git rev-parse refs/remotes/remote1/side2 &&
> + test_must_fail git rev-parse refs/remotes/remote2/side2
> + )
> +'

OK.

Other than that it is not clear if we want it, the change looks
cleanly done and was a pleasant read (other than whitespace damages
on the patch text).

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-17 19:56         ` Junio C Hamano
@ 2024-12-18  8:30           ` Shubham Kanodia
  2024-12-18 15:35             ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-18  8:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee

On Wed, Dec 18, 2024 at 1:26 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Shubham Kanodia <shubham.kanodia10@gmail.com> writes:
>
> > On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia
> > <shubham.kanodia10@gmail.com> wrote:
> >>
> >> On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote:
> >> >
> >> > Patrick Steinhardt <ps@pks.im> writes:
> >> >
> >> > > If we want to have such a feature I'd thus claim that it would be most
> >> > > sensible to make it opt-in rather than opt-out. I wouldn't want to be
> >> > > surprised by remote refs vanishing after going to bed, but may be okay
> >> > > with it when I explicitly ask for it.
> >> > >
> >> > > At that point one has to raise the question whether it is still all that
> >> > > useful compared to running `git remote prune` manually every now and
> >> > > then. Mostly because explicitly configuring maintenance is probably
> >> > > something that only power users would do, and those power users would
> >> > > likely know to prune manually.
> >> >
> >> > I am 100% in agreement with your reasoning.  If this thing is to
> >> > exist, it has to be opt-in.  We also need to add ample warning why
> >> > doing this asynchronously behind user's back while the user could be
> >> > working in the same repository is prone to cause surprises in the
> >> > documentation in big red letters.
> >> >
> >> > I however am OK with the idea of having it as an optional "task"
> >> > that is disabled by default in "git maintenance".  "git maintenance"
> >> > can be viewed as a platform-neutral way to set up scheduled tasks.
> >> >
> >> > A user may be a Git expert who knows what `git remote prune` does,
> >> > and understands and accepts the risk of confusion doing so without
> >> > explicit "do it now" from the end-user, but may not be a Linux or a
> >> > macOS or a Windows expert to know how to write crontab or its
> >> > equivalents on various schemes to define scheduled tasks.
> >> >
> >> > Thanks.
> >>
> >> Thanks for sharing your thoughts here. For some context — I currently
> >> look after git performance for a very large repository (60k+ refs)
> >> with a lot of active developers.
> >> I've observed that while power git users keep their repository tidy
> >> even without runnining maintenance, the majority of users infact don't
> >> know (or execute these), and
> >> run into performance issues. Easier fleet management was perhaps was
> >> one of the motivations behind adding this to `git-maintenance`.
> >>
> >> While its pretty rare that someone relies on a `refs/remote/*`
> >> reference (without creating a local copy of it in `refs/heads`), I can
> >> see why it can be surprising and an opt-in
> >> should be okay for me.
> >>
> >> Thanks,
> >> Shubham K
>
> If you want to mix a patch with other discussion, write a scissors
> line (e.g. "--- >8 --- cut here --- >8 ---" and nothing else on its
> own line) here before the patch begins, and tell readers that they
> need to scroll 50+ lines before they can see the patch at the
> beginning of your message if the leading part needs to be so long.
>
> > From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001
>
> Never include this line in the body of the message.
>
> > From: Shubham Kanodia <shubham.kanodia10@gmail.com>
>
> Having "From:" in the body of the message at the beginning of your
> patch is a good way to override the author identity and set it
> differently from what your MUA places on the From: header line,
> when you are forwarding somebody else's patch and want to record the
> author identity of the real author, but then this line would not
> have your name-email.  Another use case is when your MUA cannot be
> convinced to place your author identity on its From: header, but the
> message I am responding to has the same name-address in the From:
> header.
>
> So in conclusion, the above line is unnecessary in the body of this
> message.
>
> > Date: Tue, 17 Dec 2024 16:05:16 +0530
>
> Rarely it is a good idea to include this line in the body of the
> message.  The Date: header records the time the general public saw
> the patch for the first time, which is the author date as far as the
> general public is concered, much better.
>
> > Subject: [PATCH] maintenance: add prune-remote-refs task
>
> Overriding "Subject:" header of the e-mail with this in-body header
> is almost always needed when you are responding to an ongoing
> discussion and including a patch after a scissors line.
>
> So in short, "From/From:/Date:" should be dropped, "Subject:" kept,
> and insert a scissors line before it.
>
> > Remote-tracking refs can accumulate in local repositories even as branches
> > are deleted on remotes, impacting git performance.
>
> OK.  It usually is better to be explicit about the direction of the
> "impact".  E.g., "impacting git performance negatively".
>
> Have you considered using the reftable backend, instead of the files
> backend, by the way?  It is rumored to scale much better with the
> number of refs.
>
> It is possible that the user deliberately kept stale remote-tracking
> branches even after they are removed from the origin because they
> used them as the bases of their topic branches, in which case it
> would not be a good idea to blindly remove them, but I do not think
> that needs to be said here.  These users would not be running
> "git remote prune".
>
> > While git fetch --prune
> > helps clean these up, it requires a full fetch operation which is expensive
> > for large repositories.
>
> It is curious why "git fetch --prune" is mentioned here, especially
> when you know "git remote prune" is much more appropriate operation
> for that purpose.  The real justification this paragraph needed to
> make is why it is a good idea to a new task in "git maintenance",
> rather than letting the user run "git remote prune" themselves,
> isn't it?
>
> > Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune'
> > for each configured remote. This provides an automated way to clean up stale
> > remote-tracking refs without requiring full fetches or manual intervention.
>
> Again, "requiring full fetches or" is probably out of place.
>
> > This task is disabled by default.
>
> If we wanted to mention "fetch --prune", it is probably not as an
> inferiour alternative to what you invented here, but as a different
> way the user can work that makes the problem you are trying to solve
> go away.  Perhaps near the beginning, after the "Remote-tracking
> refs can accumulate and impact performance negatively", say
> something like:
>
>     ... negatively.  The user can have the "fetch.prune"
>     configuration variable set to true when interacting with the
>     other side, and stale remote-tracking refs will automatically
>     pruned at zero cost.
>
> Of course "remote.<name>.prune" is a more targetted way that works
> per remote, but your prune-remote-refs works on all remotes,
> fetch.prune is a better contrast.
>
> In any case, stepping back a bit, for the population of user who
> benefit from enabing the prune-remote-refs task, wouldn't it be an
> even better solution for them to set fetch.prune?  You can tell them
> to run "git remote prune" just once, set that configuration
> variable, and then the remote-tracking branches will stay clean from
> then on.  Any future interactions with the remote make sure stale
> remote-tracking branches will be removed automatically.  Wouldn't
> that be a much better option?  I am sure I must be missing a use
> case where fetch.prune (or remote.<name>.prune) is not a good idea
> but background prune-remote-refs task works better.
>
> So, if we want to still add this new maintenance task, we should
> explain *that* use case after the proposed addition of "you can do
> with fetch.prune at zero cost" to the first paragraph.  IOW, before
> starting "Add a new maintenance task ...", we need to say something
> like:
>
>     There however are cases where pruning stale remote-tracking refs
>     at the "fetch" time is undesirable, yet keeping them indefinitely
>     is also unacceptable.  <<HERE YOU DESCRIBE THE USE CASE WITH
>     SUCH CHARACTERISTICS>>.  To help such use cases, add a new
>     maintenance task ...
>
> And if there is no such workflow where a background prune-remote-refs
> task is needed and fetch.prune would not work well, then we do not
> need a new task, do we?
>
> Let me review the change (assuming that there are cases where
> fetch.prune is not a good solution) below.
>
> > +prune-remote-refs::
> > + The `prune-remote-refs` task runs `git remote prune` on each remote
> > + repository registered in the local repository. This task helps clean
> > + up deleted remote branches, improving the performance of operations
> > + that iterate through the refs. See linkgit:git-remote[1] for more
> > + information. This task is disabled by default and must be enabled
> > + manually.
> > ++
> > +NOTE: This task is opt-in to prevent unexpected removal of remote refs
> > +for users of git-maintenance. It is recommended for power users who
> > +are comfortable with configuring maintenance tasks and understand the
> > +implications of automatic pruning. For others, running `git remote prune`
> > +manually might be more appropriate.
>
> "and must be enabled" is unnecessary---that is what "disabled by
> default" means.  Also, we probably would want to add a sentence or
> two to contrast this with setting fetch.prune to highlight why users
> would want to consider enabling this (instead of setting fetch.prune).
>
> Otherwise, nicely written.
>
> >  OPTIONS
> >  -------
> >  --auto::
> > diff --git a/builtin/gc.c b/builtin/gc.c
> > index 4ae5196aed..9acf1d2989 100644
> > --- a/builtin/gc.c
> > +++ b/builtin/gc.c
> > @@ -20,6 +20,7 @@
> >  #include "lockfile.h"
> >  #include "parse-options.h"
> >  #include "run-command.h"
> > +#include "remote.h"
> >  #include "sigchain.h"
> >  #include "strvec.h"
> >  #include "commit.h"
> > @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct
> > option *opt, const char *arg,
> >   return 0;
> >  }
> >
> > +static int collect_remote(struct remote *remote, void *cb_data)
> > +{
> > + struct string_list *list = cb_data;
> > +
> > + if (!remote->url.nr)
> > + return 0;
> > +
> > + string_list_append(list, remote->name);
> > + return 0;
> > +}
> > +
> > +static int maintenance_task_prune_remote(struct maintenance_run_opts
> > *opts UNUSED,
> > + struct gc_config *cfg UNUSED)
> > +{
> > + struct string_list_item *item;
> > + struct string_list remotes_list = STRING_LIST_INIT_NODUP;
> > + struct child_process child = CHILD_PROCESS_INIT;
> > + int result = 0;
> > +
> > + for_each_remote(collect_remote, &remotes_list);
> > +
> > + for_each_string_list_item (item, &remotes_list) {
> > + const char *remote_name = item->string;
>
> Is the patch whitespace-damaged?  The indentation seems to be gone.
> I won't comment on misindented lines and blocks in this review (i.e.
> even if the original had such problems, I wouldn't be seeing them in
> the message I am responding to).
>
> > + child.git_cmd = 1;
> > + strvec_pushl(&child.args, "remote", "prune", remote_name, NULL);
> > +
> > + if (run_command(&child))
> > + result = error(_("failed to prune '%s'"), remote_name);
> > + }
> > +
> > + string_list_clear(&remotes_list, 0);
> > + return result;
> > +}
>
> OK.
>
> >  /* Remember to update object flag allocation in object.h */
> >  #define SEEN (1u<<0)
> >
> > @@ -1375,6 +1410,7 @@ enum maintenance_task_label {
> >   TASK_GC,
> >   TASK_COMMIT_GRAPH,
> >   TASK_PACK_REFS,
> > + TASK_PRUNE_REMOTE_REFS,
> >
> >   /* Leave as final value */
> >   TASK__COUNT
> > @@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = {
> >   maintenance_task_pack_refs,
> >   pack_refs_condition,
> >   },
> > + [TASK_PRUNE_REMOTE_REFS] = {
> > + "prune-remote-refs",
> > + maintenance_task_prune_remote,
> > + },
> >  };
> >
> >  static int compare_tasks_by_selection(const void *a_, const void *b_)
> > @@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void)
> >   tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY;
> >   tasks[TASK_PACK_REFS].enabled = 1;
> >   tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY;
> > + tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0;
> > + tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY;
> >   }
> >  }
>
> OK.
>
> > diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> > index 0ce4ba1cbe..60a0c3f835 100755
> > --- a/t/t7900-maintenance.sh
> > +++ b/t/t7900-maintenance.sh
> > @@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' '
> >   test_subcommand git pack-refs --all --prune <pack-refs.txt
> >  '
> >
> > +test_expect_success 'prune-remote-refs task not enabled by default' '
> > + git clone . prune-test &&
> > + (
> > + cd prune-test &&
> > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err &&
> > + test_subcommand ! git remote prune origin <prune.txt
> > + )
> > +'
>
> OK.
>
> > +test_expect_success 'prune-remote-refs task cleans stale remote refs' '
> > + test_commit initial &&
> > +
> > + # Create two separate remote repos
> > + git clone . remote1 &&
> > + git clone . remote2 &&
> > +
> > + git clone . prune-test-clean &&
> > + (
> > + cd prune-test-clean &&
> > + git config maintenance.prune-remote-refs.enabled true &&
> > +
> > + # Add both remotes
> > + git remote add remote1 "../remote1" &&
> > + git remote add remote2 "../remote2" &&
> > +
> > + # Create and push branches to both remotes
> > + git branch -f side2 HEAD &&
> > + git push remote1 side2 &&
> > + git push remote2 side2 &&
> > +
> > + # Rename branches in each remote to simulate a stale branch
> > + git -C ../remote1 branch -m side2 side3 &&
> > + git -C ../remote2 branch -m side2 side4 &&
> > +
> > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run
> > --task=prune-remote-refs &&
> > +
> > + # Verify pruning happened for both remotes
> > + test_subcommand git remote prune remote1 <prune.txt &&
> > + test_subcommand git remote prune remote2 <prune.txt &&
> > + test_must_fail git rev-parse refs/remotes/remote1/side2 &&
> > + test_must_fail git rev-parse refs/remotes/remote2/side2
> > + )
> > +'
>
> OK.
>
> Other than that it is not clear if we want it, the change looks
> cleanly done and was a pleasant read (other than whitespace damages
> on the patch text).
>
> Thanks.

Let me address a few points here before I resend the patch in a more
ideally formatted manner.

> Have you considered using the reftable backend, instead of the files
> backend, by the way?  It is rumored to scale much better with the
> number of refs.

I don't think we've done that yet, but thanks for the pointer. In any
case, till reftable is stable (including support for reflogs etc),
we'd want to avoid switching all clients to use that,

> In any case, stepping back a bit, for the population of user who
> benefit from enabing the prune-remote-refs task, wouldn't it be an
> even better solution for them to set fetch.prune?  You can tell them
> to run "git remote prune" just once, set that configuration
> variable, and then the remote-tracking branches will stay clean from
> then on.  Any future interactions with the remote make sure stale
> remote-tracking branches will be removed automatically.  Wouldn't
> that be a much better option?  I am sure I must be missing a use
> case where fetch.prune (or remote.<name>.prune) is not a good idea
> but background prune-remote-refs task works better.

Let me expand on the context for suggesting this change:

I work with a large repository that has over 50k refs, with about 4k
new ones added weekly.
We have maintenance scripts on our git server that clean up stale refs
(unused older than N months).

Using `fetch.prune` with a normal git fetch isn't ideal because it
would cause git fetch to unnecessarily download many new refs that
users don't need. So we actively discourage that.

In theory, users could just run `git remote prune` once and carefully
avoid full fetches to keep their local ref count low.
However, in practice, we've found that full fetches happen through
various indirect means:

- Shell plugins like zsh/pure
- Git GUIs like Sourcetree
- Code editors like VSCode

among others.

These tools sometimes perform background fetches without users
realizing it, causing ref counts to gradually increase until users
remember to run `git remote prune` again.

Typically, most users care about:

- Their local branches (refs/heads) - usually a few hundred in number
- A small subset of remote branches they might work with from time to
time (>99% of active remote refs are irrelevant to any individual
developer in a large repo)

The goal here is to help users who can already run `git remote prune`
to do so periodically.
- If full `git fetch` is completely avoided, this will gradually
reduce the local ref count from tens of thousands to just a few
hundred active refs (even if the remote has 50k+ active refs) as old
branches on the remote expire with time.
- Even if not —say, if an errant tool or the developer executes `git
fetch` mistakenly, then the maintenance job ensures this doesn't
become their permanent state until the next manual remote prune.

Ultimately, this is about providing a better default for large-scale
repositories (& users who choose to keep refs pruned).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-18  8:30           ` Shubham Kanodia
@ 2024-12-18 15:35             ` Junio C Hamano
  2024-12-22 12:19               ` Shubham Kanodia
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-12-18 15:35 UTC (permalink / raw)
  To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee

Shubham Kanodia <shubham.kanodia10@gmail.com> writes:

>> ...
>> Thanks.

[administrivia: respond inline, trim out parts that do not have to
be read by bystanders to understand your response].

>> In any case, stepping back a bit, for the population of user who
>> benefit from enabing the prune-remote-refs task, wouldn't it be an
>> even better solution for them to set fetch.prune?  You can tell them
>> to run "git remote prune" just once, set that configuration
>> variable, and then the remote-tracking branches will stay clean from
>> then on.  Any future interactions with the remote make sure stale
>> remote-tracking branches will be removed automatically.  Wouldn't
>> that be a much better option?  I am sure I must be missing a use
>> case where fetch.prune (or remote.<name>.prune) is not a good idea
>> but background prune-remote-refs task works better.
>
> Let me expand on the context for suggesting this change:
>
> I work with a large repository that has over 50k refs, with about 4k
> new ones added weekly.
> We have maintenance scripts on our git server that clean up stale refs
> (unused older than N months).

> Using `fetch.prune` with a normal git fetch isn't ideal because it
> would cause git fetch to unnecessarily download many new refs that
> users don't need. So we actively discourage that.

This is what I did not quite understand.  What do your users
normally do to update their repository from the remote to become in
sync, if they are not running "git fetch"?

	Side note: it is very likely that your users are not
	directly be running "git fetch", but using various
	front-ends like "git pull", "git pull --rebase", or even
	"repo", but they all at some point call "git fetch" to get
	the new objects and update refs.

Ah, are they using "git fetch origin +foo:refs/remotes/origin/foo",
i.e., only selectively fetch the thing that they use and nothing
else (again, their wrappers may supply the refspec to do the
limiting)?  Now it slowly starts to make sense to me (sorry, I am
slow, especially without caffeine in the morning).

Am I following / guessing your set-up more or less correctly so far?

In any case, if your users are doing selective fetching, 50k refs or
4k ref turnover per week on the other side does not really matter.
Your users' desktop repositories won't see remote-tracking refs that
they didn't use and ask for.

But you are right that these selectively fetched refs will
accumulate unless pruned, and fetch.prune would not prune anything
when

	git fetch origin +foo:refs/remotes/origin/foo

because it will not prune what is outside the hierarchy the refspec
covers and this is a deliberate design decision.

For "git fetch origin '+refs/heads/*:refs/remotes/origin/*'", which
is pretty much how "git clone" sets up the remotes, anything we have
in refs/remotes/origin/ hierarchy that do not appear in the current
refs/heads/ hiearchy they have are pruned with fetch.prune=true.
But if you fetch selectively, either 'foo' exists (in which case it
won't be pruned), or 'foo' went away (in which case the fetch itself
fails before even pruning what is on our end), so fetch.prune may
not help.

And at least for a shorter term, periodically running "remote prune"
would be an acceptable workaround for such a workflow.

In the longer term, I suspect that we may want a new option that
lets you more aggressively prune your remote-tracking refs, telling
the tool something like

  git fetch --prune-aggressive origin +refs/heads/foo:refs/remotes/origin/foo

to mean "I only am interested in getting the object to complete
their current 'foo' branch, and get my remote-tracking ref for that
branch updated, BUT if you notice some ref in my refs/remotes/origin/*
that they do not have in their refs/heads/*, please prune it, even when
they are not 'foo' (which means normal --prune would not prune them)",
would not be a terrible idea.

It would be more involved than running "remote prune" periodically,
of course.

> In theory, users could just run `git remote prune` once and carefully
> avoid full fetches to keep their local ref count low.
> However, in practice, we've found that full fetches happen through
> various indirect means:
>
> - Shell plugins like zsh/pure
> - Git GUIs like Sourcetree
> - Code editors like VSCode
>
> among others.

And do any of these bypass underlying "git fetch"?  If not, then one
easier solution is to accept that somebody will do the regular

	refs/heads/*:refs/remotes/origin/*

full-fetch *anyway*.  Once we accept that to happen, we can tell
"git fetch" to always prune.  Then when these "various indirect
means" attempt their full fetch, "git fetch" invoked by them would
still honor fetch.prune, so even though they would try to maintain
these 50k refs in-sync with the remote, which means you may see 4k
new refs per week, but plausibly the refs that got retired are seen
as stale and get removed on the users' repositories.

> - If full `git fetch` is completely avoided, this will gradually
> reduce the local ref count from tens of thousands to just a few
> hundred active refs (even if the remote has 50k+ active refs) as old
> branches on the remote expire with time.

Yes.  You'd somehow need to arrange these third-party tools not to
fetch too much unneeded cruft.

> - Even if not —say, if an errant tool or the developer executes `git
> fetch` mistakenly, then the maintenance job ensures this doesn't
> become their permanent state until the next manual remote prune.

For the latter case, fetch.prune=true would be the ideal solution, I
would imagine.  The "errant tool"'s 'git fetch' would prune the
stale ones.

Now, the documentation should explain when this "periodically running
remote prune" is an acceptable workaround and/or a useful solution,
relative to setting fetch.prune, as most parts of the existing
documentation do assume that the users, intended audience of the
document, are using the bog-standard "git clone" result, that copies
all their branches to remote-tracking branches.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-18 15:35             ` Junio C Hamano
@ 2024-12-22 12:19               ` Shubham Kanodia
  2024-12-23  4:21                 ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-22 12:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee

On Wed, Dec 18, 2024 at 9:05 PM Junio C Hamano <gitster@pobox.com> wrote:
...

> Ah, are they using "git fetch origin +foo:refs/remotes/origin/foo",
> i.e., only selectively fetch the thing that they use and nothing
> else (again, their wrappers may supply the refspec to do the
> limiting)?  Now it slowly starts to make sense to me (sorry, I am
> slow, especially without caffeine in the morning).
>
> Am I following / guessing your set-up more or less correctly so far?

Yes. The root cause of the issue here is that the behaviour that `git
fetch` / `git pull` fetches all refs by default is undesirable in
large git repositories.
It's almost never what you want to do. We advice users to execute `git
fetch <remote> <ref>` when they want to run fetch (/pull) explicitly.

Ideally, `git fetch` would only fetch the current branch when an
explicit branch is not specified, and `git fetch --all` would pull in
all.
Now I believe git does provide a way to configure the default fetch
refs via `remote.<name>.fetch`. So in theory I could just set —

```
[remote "origin"]
      fetch = +refs/heads/master:refs/remotes/origin/master
```

which would avoid someone pulling in all refs accidentally.
However that has a side effect that now if you do want to fetch &
start working on a remote ref that you weren't previously tracking, a
command like

```
git fetch origin new-ref-branch-from-remote
```

no longer allows you to just start working on this new branch by doing
a `git checkout new-ref-branch-from-remote`.

If you wanted to be able to do that, you'd probably need to do —

```
    git fetch origin new-ref-branch-from-remote
--refmap=+refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote
```

which is pretty awkward to type everytime.

Now to come back from this little digression, for now —
- We ask users to set both `fetch.prune` and `fetch.pruneTags` to true
(so that if a third party tool does do something, the damage is
limited and they don't have an ever-growing list of refs)
- Setup this job that cleans stale remote refs on a periodic basis,
which means that their ref counts heal over time (if they configure
all third party tools right)


> Now, the documentation should explain when this "periodically running
> remote prune" is an acceptable workaround and/or a useful solution,
> relative to setting fetch.prune, as most parts of the existing
> documentation do assume that the users, intended audience of the
> document, are using the bog-standard "git clone" result, that copies
> all their branches to remote-tracking branches.

Agreed, will update docs to include that.

Thanks,
Shubham K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-22 12:19               ` Shubham Kanodia
@ 2024-12-23  4:21                 ` Junio C Hamano
  2024-12-23  9:30                   ` Shubham Kanodia
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2024-12-23  4:21 UTC (permalink / raw)
  To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee

Shubham Kanodia <shubham.kanodia10@gmail.com> writes:

> If you wanted to be able to do that, you'd probably need to do —
>
> ```
>     git fetch origin new-ref-branch-from-remote
> --refmap=+refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote
> ```
>
> which is pretty awkward to type everytime.

Or have this in your .git/config

    [remote "origin"]
	fetch = +refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote

and run

    $ git fetch origin new-ref-branch-from-remote

If you plan to do the same for more random branches, you could
instead do

    [remote "origin"]
	fetch = +refs/heads/*:refs/remotes/origin/*

as long as you promise that you never run "git fetch" without saying
which exact refs to fetch.  I wonder if it helps to introduce a new
configuration remote.*.fetchmap

    [remote "origin"]
	fetchmap = +refs/heads/*:refs/remotes/origin/*

that only talks about how the mapping goes without saying what gets
fetched by default, so that the lack of remote.origin.fetch would
cause

    $ git fetch origin
    $ git fetch

not to grab everything (like when you had the same specification as
the value of remote.origin.fetch instead), but allows you to map
whatever you grab from there when you did

    $ git fetch origin foo bar

If the glob pattern given by the fetchmap participated in how
fetch.prune behaves (i.e. I see refs/remotes/origin/baz here, but
while trying to fetch refs/heads/{foo,bar}, I notice that they no
longer advertise refs/heads/baz---let me prune that stale one),
that would make it much simpler.

> Now to come back from this little digression, for now —
> - We ask users to set both `fetch.prune` and `fetch.pruneTags` to true
> (so that if a third party tool does do something, the damage is
> limited and they don't have an ever-growing list of refs)

Good.

> - Setup this job that cleans stale remote refs on a periodic basis,
> which means that their ref counts heal over time (if they configure
> all third party tools right)

OK.  And a scheduled task would make this part better.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consider adding pruning of refs to git maintenance
  2024-12-23  4:21                 ` Junio C Hamano
@ 2024-12-23  9:30                   ` Shubham Kanodia
  0 siblings, 0 replies; 11+ messages in thread
From: Shubham Kanodia @ 2024-12-23  9:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee

On Mon, Dec 23, 2024 at 9:51 AM Junio C Hamano <gitster@pobox.com> wrote:

> as long as you promise that you never run "git fetch" without saying
> which exact refs to fetch.  I wonder if it helps to introduce a new
> configuration remote.*.fetchmap
>
>     [remote "origin"]
>         fetchmap = +refs/heads/*:refs/remotes/origin/*
>
> that only talks about how the mapping goes without saying what gets
> fetched by default, so that the lack of remote.origin.fetch would
> cause
>
>     $ git fetch origin
>     $ git fetch
>
> not to grab everything (like when you had the same specification as
> the value of remote.origin.fetch instead), but allows you to map
> whatever you grab from there when you did

This would be perfect really. Adding a new refmap every time you need
to pull in changes from someone else's branch is too cumbersome.
I'd be happy to attempt this as a follow up patch where
`remote.<remote>.fetchmap` if you think that's possible to add.

Also to keep things cleaner — I'm going to re-submit this patch using
gitgitgadget separately.

Thanks!
Shubham K

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-12-23  9:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 12:33 Consider adding pruning of refs to git maintenance Shubham Kanodia
2024-12-17  7:21 ` Patrick Steinhardt
2024-12-17  7:41   ` Junio C Hamano
2024-12-17 11:21     ` Shubham Kanodia
2024-12-17 11:24       ` Shubham Kanodia
2024-12-17 19:56         ` Junio C Hamano
2024-12-18  8:30           ` Shubham Kanodia
2024-12-18 15:35             ` Junio C Hamano
2024-12-22 12:19               ` Shubham Kanodia
2024-12-23  4:21                 ` Junio C Hamano
2024-12-23  9:30                   ` Shubham Kanodia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).