* Consider adding pruning of refs to git maintenance @ 2024-12-16 12:33 Shubham Kanodia 2024-12-17 7:21 ` Patrick Steinhardt 0 siblings, 1 reply; 11+ messages in thread From: Shubham Kanodia @ 2024-12-16 12:33 UTC (permalink / raw) To: git, Patrick Steinhardt Remote-tracking refs accumulate quickly in large repositories as users merge and delete their branches. While these branches are cleaned up on the remote, local repositories may retain stale references to deleted branches unless explicitly pruned. The number of local refs can have an impact on git performance of several commands. Git currently provides two ways for orphan local refs to be cleaned up — 1. Automated: `fetch.prune` and `fetch.pruneTags` configurations with `git fetch/pull` 2. Manual: `git remote prune` However, both approaches have issues: - Full `git fetch/pull` operations are expensive on large repositories, pulling thousands of irrelevant refs - Manual `git remote prune` requires user intervention Proposal: Add remote pruning to the daily `git-maintenance` task. This would clean stale refs automatically without requiring full fetches or manual intervention. This is especially useful for users who historically pulled all refs/tags but now use targeted fetches. Moreover, it decouples the cleanup action (pruning) from the action to fetch more refs. Happy to submit on a patch for the same unless there's something obvious that I've missed here. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-16 12:33 Consider adding pruning of refs to git maintenance Shubham Kanodia @ 2024-12-17 7:21 ` Patrick Steinhardt 2024-12-17 7:41 ` Junio C Hamano 0 siblings, 1 reply; 11+ messages in thread From: Patrick Steinhardt @ 2024-12-17 7:21 UTC (permalink / raw) To: Shubham Kanodia; +Cc: git, Derrick Stolee On Mon, Dec 16, 2024 at 06:03:03PM +0530, Shubham Kanodia wrote: > Remote-tracking refs accumulate quickly in large repositories as users > merge and delete their branches. While these branches are cleaned up > on the remote, local repositories may retain stale references to > deleted branches unless explicitly pruned. The number of local refs > can have an impact on git performance of several commands. > > Git currently provides two ways for orphan local refs to be cleaned up — > 1. Automated: `fetch.prune` and `fetch.pruneTags` configurations with > `git fetch/pull` > 2. Manual: `git remote prune` > > However, both approaches have issues: > - Full `git fetch/pull` operations are expensive on large > repositories, pulling thousands of irrelevant refs > - Manual `git remote prune` requires user intervention Fair. Neither of those issues feel insurmountable, but I can see why it could make our users lifes easier. > Proposal: > Add remote pruning to the daily `git-maintenance` task. This would > clean stale refs automatically without requiring full fetches or > manual intervention. > > This is especially useful for users who historically pulled all > refs/tags but now use targeted fetches. Moreover, it decouples the > cleanup action (pruning) from the action to fetch more refs. I think we need to consider a couple of things: - It's somewhat awkward to have maintenance jobs that interact with a remote, as that may not work in contexts where you actually need to authenticate. But there is precedent with the "prefetch" task, so we have already opened that can of worms. - Maintenance tries to be as non-destructive as reasonably possible, and deleting refs certainly is a destructive operation. - We try to avoid bad interactions with a user that works concurrently in the repo that git-maintenance(1) runs in. This is the reason why the "prefetch" task does not fetch into `refs/remotes`, but into a separate ref namespace. If we want to have such a feature I'd thus claim that it would be most sensible to make it opt-in rather than opt-out. I wouldn't want to be surprised by remote refs vanishing after going to bed, but may be okay with it when I explicitly ask for it. At that point one has to raise the question whether it is still all that useful compared to running `git remote prune` manually every now and then. Mostly because explicitly configuring maintenance is probably something that only power users would do, and those power users would likely know to prune manually. In any case, that's just my 2c. I can see a usecase for your feature, but think we should be careful with how it is introduced. > Happy to submit on a patch for the same unless there's something > obvious that I've missed here. I'm happy to have a look in case you decide to implement this feature. Patrick ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-17 7:21 ` Patrick Steinhardt @ 2024-12-17 7:41 ` Junio C Hamano 2024-12-17 11:21 ` Shubham Kanodia 0 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-12-17 7:41 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Shubham Kanodia, git, Derrick Stolee Patrick Steinhardt <ps@pks.im> writes: > If we want to have such a feature I'd thus claim that it would be most > sensible to make it opt-in rather than opt-out. I wouldn't want to be > surprised by remote refs vanishing after going to bed, but may be okay > with it when I explicitly ask for it. > > At that point one has to raise the question whether it is still all that > useful compared to running `git remote prune` manually every now and > then. Mostly because explicitly configuring maintenance is probably > something that only power users would do, and those power users would > likely know to prune manually. I am 100% in agreement with your reasoning. If this thing is to exist, it has to be opt-in. We also need to add ample warning why doing this asynchronously behind user's back while the user could be working in the same repository is prone to cause surprises in the documentation in big red letters. I however am OK with the idea of having it as an optional "task" that is disabled by default in "git maintenance". "git maintenance" can be viewed as a platform-neutral way to set up scheduled tasks. A user may be a Git expert who knows what `git remote prune` does, and understands and accepts the risk of confusion doing so without explicit "do it now" from the end-user, but may not be a Linux or a macOS or a Windows expert to know how to write crontab or its equivalents on various schemes to define scheduled tasks. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-17 7:41 ` Junio C Hamano @ 2024-12-17 11:21 ` Shubham Kanodia 2024-12-17 11:24 ` Shubham Kanodia 0 siblings, 1 reply; 11+ messages in thread From: Shubham Kanodia @ 2024-12-17 11:21 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > If we want to have such a feature I'd thus claim that it would be most > > sensible to make it opt-in rather than opt-out. I wouldn't want to be > > surprised by remote refs vanishing after going to bed, but may be okay > > with it when I explicitly ask for it. > > > > At that point one has to raise the question whether it is still all that > > useful compared to running `git remote prune` manually every now and > > then. Mostly because explicitly configuring maintenance is probably > > something that only power users would do, and those power users would > > likely know to prune manually. > > I am 100% in agreement with your reasoning. If this thing is to > exist, it has to be opt-in. We also need to add ample warning why > doing this asynchronously behind user's back while the user could be > working in the same repository is prone to cause surprises in the > documentation in big red letters. > > I however am OK with the idea of having it as an optional "task" > that is disabled by default in "git maintenance". "git maintenance" > can be viewed as a platform-neutral way to set up scheduled tasks. > > A user may be a Git expert who knows what `git remote prune` does, > and understands and accepts the risk of confusion doing so without > explicit "do it now" from the end-user, but may not be a Linux or a > macOS or a Windows expert to know how to write crontab or its > equivalents on various schemes to define scheduled tasks. > > Thanks. Thanks for sharing your thoughts here. For some context — I currently look after git performance for a very large repository (60k+ refs) with a lot of active developers. I've observed that while power git users keep their repository tidy even without runnining maintenance, the majority of users infact don't know (or execute these), and run into performance issues. Easier fleet management was perhaps was one of the motivations behind adding this to `git-maintenance`. While its pretty rare that someone relies on a `refs/remote/*` reference (without creating a local copy of it in `refs/heads`), I can see why it can be surprising and an opt-in should be okay for me. Thanks, Shubham K ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-17 11:21 ` Shubham Kanodia @ 2024-12-17 11:24 ` Shubham Kanodia 2024-12-17 19:56 ` Junio C Hamano 0 siblings, 1 reply; 11+ messages in thread From: Shubham Kanodia @ 2024-12-17 11:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia <shubham.kanodia10@gmail.com> wrote: > > On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Patrick Steinhardt <ps@pks.im> writes: > > > > > If we want to have such a feature I'd thus claim that it would be most > > > sensible to make it opt-in rather than opt-out. I wouldn't want to be > > > surprised by remote refs vanishing after going to bed, but may be okay > > > with it when I explicitly ask for it. > > > > > > At that point one has to raise the question whether it is still all that > > > useful compared to running `git remote prune` manually every now and > > > then. Mostly because explicitly configuring maintenance is probably > > > something that only power users would do, and those power users would > > > likely know to prune manually. > > > > I am 100% in agreement with your reasoning. If this thing is to > > exist, it has to be opt-in. We also need to add ample warning why > > doing this asynchronously behind user's back while the user could be > > working in the same repository is prone to cause surprises in the > > documentation in big red letters. > > > > I however am OK with the idea of having it as an optional "task" > > that is disabled by default in "git maintenance". "git maintenance" > > can be viewed as a platform-neutral way to set up scheduled tasks. > > > > A user may be a Git expert who knows what `git remote prune` does, > > and understands and accepts the risk of confusion doing so without > > explicit "do it now" from the end-user, but may not be a Linux or a > > macOS or a Windows expert to know how to write crontab or its > > equivalents on various schemes to define scheduled tasks. > > > > Thanks. > > Thanks for sharing your thoughts here. For some context — I currently > look after git performance for a very large repository (60k+ refs) > with a lot of active developers. > I've observed that while power git users keep their repository tidy > even without runnining maintenance, the majority of users infact don't > know (or execute these), and > run into performance issues. Easier fleet management was perhaps was > one of the motivations behind adding this to `git-maintenance`. > > While its pretty rare that someone relies on a `refs/remote/*` > reference (without creating a local copy of it in `refs/heads`), I can > see why it can be surprising and an opt-in > should be okay for me. > > Thanks, > Shubham K From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001 From: Shubham Kanodia <shubham.kanodia10@gmail.com> Date: Tue, 17 Dec 2024 16:05:16 +0530 Subject: [PATCH] maintenance: add prune-remote-refs task Remote-tracking refs can accumulate in local repositories even as branches are deleted on remotes, impacting git performance. While git fetch --prune helps clean these up, it requires a full fetch operation which is expensive for large repositories. Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune' for each configured remote. This provides an automated way to clean up stale remote-tracking refs without requiring full fetches or manual intervention. This task is disabled by default. Signed-off-by: Shubham Kanodia <shubham.kanodia10@gmail.com> --- Documentation/git-maintenance.txt | 14 ++++++++++ builtin/gc.c | 42 +++++++++++++++++++++++++++++ t/t7900-maintenance.sh | 44 +++++++++++++++++++++++++++++++ 3 files changed, 100 insertions(+) diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt index 6e6651309d..3e0ca84de2 100644 --- a/Documentation/git-maintenance.txt +++ b/Documentation/git-maintenance.txt @@ -158,6 +158,20 @@ pack-refs:: need to iterate across many references. See linkgit:git-pack-refs[1] for more information. +prune-remote-refs:: + The `prune-remote-refs` task runs `git remote prune` on each remote + repository registered in the local repository. This task helps clean + up deleted remote branches, improving the performance of operations + that iterate through the refs. See linkgit:git-remote[1] for more + information. This task is disabled by default and must be enabled + manually. ++ +NOTE: This task is opt-in to prevent unexpected removal of remote refs +for users of git-maintenance. It is recommended for power users who +are comfortable with configuring maintenance tasks and understand the +implications of automatic pruning. For others, running `git remote prune` +manually might be more appropriate. + OPTIONS ------- --auto:: diff --git a/builtin/gc.c b/builtin/gc.c index 4ae5196aed..9acf1d2989 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -20,6 +20,7 @@ #include "lockfile.h" #include "parse-options.h" #include "run-command.h" +#include "remote.h" #include "sigchain.h" #include "strvec.h" #include "commit.h" @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct option *opt, const char *arg, return 0; } +static int collect_remote(struct remote *remote, void *cb_data) +{ + struct string_list *list = cb_data; + + if (!remote->url.nr) + return 0; + + string_list_append(list, remote->name); + return 0; +} + +static int maintenance_task_prune_remote(struct maintenance_run_opts *opts UNUSED, + struct gc_config *cfg UNUSED) +{ + struct string_list_item *item; + struct string_list remotes_list = STRING_LIST_INIT_NODUP; + struct child_process child = CHILD_PROCESS_INIT; + int result = 0; + + for_each_remote(collect_remote, &remotes_list); + + for_each_string_list_item (item, &remotes_list) { + const char *remote_name = item->string; + child.git_cmd = 1; + strvec_pushl(&child.args, "remote", "prune", remote_name, NULL); + + if (run_command(&child)) + result = error(_("failed to prune '%s'"), remote_name); + } + + string_list_clear(&remotes_list, 0); + return result; +} + /* Remember to update object flag allocation in object.h */ #define SEEN (1u<<0) @@ -1375,6 +1410,7 @@ enum maintenance_task_label { TASK_GC, TASK_COMMIT_GRAPH, TASK_PACK_REFS, + TASK_PRUNE_REMOTE_REFS, /* Leave as final value */ TASK__COUNT @@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = { maintenance_task_pack_refs, pack_refs_condition, }, + [TASK_PRUNE_REMOTE_REFS] = { + "prune-remote-refs", + maintenance_task_prune_remote, + }, }; static int compare_tasks_by_selection(const void *a_, const void *b_) @@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void) tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY; tasks[TASK_PACK_REFS].enabled = 1; tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY; + tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0; + tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY; } } diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh index 0ce4ba1cbe..60a0c3f835 100755 --- a/t/t7900-maintenance.sh +++ b/t/t7900-maintenance.sh @@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' ' test_subcommand git pack-refs --all --prune <pack-refs.txt ' +test_expect_success 'prune-remote-refs task not enabled by default' ' + git clone . prune-test && + ( + cd prune-test && + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err && + test_subcommand ! git remote prune origin <prune.txt + ) +' + +test_expect_success 'prune-remote-refs task cleans stale remote refs' ' + test_commit initial && + + # Create two separate remote repos + git clone . remote1 && + git clone . remote2 && + + git clone . prune-test-clean && + ( + cd prune-test-clean && + git config maintenance.prune-remote-refs.enabled true && + + # Add both remotes + git remote add remote1 "../remote1" && + git remote add remote2 "../remote2" && + + # Create and push branches to both remotes + git branch -f side2 HEAD && + git push remote1 side2 && + git push remote2 side2 && + + # Rename branches in each remote to simulate a stale branch + git -C ../remote1 branch -m side2 side3 && + git -C ../remote2 branch -m side2 side4 && + + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run --task=prune-remote-refs && + + # Verify pruning happened for both remotes + test_subcommand git remote prune remote1 <prune.txt && + test_subcommand git remote prune remote2 <prune.txt && + test_must_fail git rev-parse refs/remotes/remote1/side2 && + test_must_fail git rev-parse refs/remotes/remote2/side2 + ) +' + test_expect_success '--auto and --schedule incompatible' ' test_must_fail git maintenance run --auto --schedule=daily 2>err && test_grep "at most one" err -- 2.46.2 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-17 11:24 ` Shubham Kanodia @ 2024-12-17 19:56 ` Junio C Hamano 2024-12-18 8:30 ` Shubham Kanodia 0 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-12-17 19:56 UTC (permalink / raw) To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia > <shubham.kanodia10@gmail.com> wrote: >> >> On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote: >> > >> > Patrick Steinhardt <ps@pks.im> writes: >> > >> > > If we want to have such a feature I'd thus claim that it would be most >> > > sensible to make it opt-in rather than opt-out. I wouldn't want to be >> > > surprised by remote refs vanishing after going to bed, but may be okay >> > > with it when I explicitly ask for it. >> > > >> > > At that point one has to raise the question whether it is still all that >> > > useful compared to running `git remote prune` manually every now and >> > > then. Mostly because explicitly configuring maintenance is probably >> > > something that only power users would do, and those power users would >> > > likely know to prune manually. >> > >> > I am 100% in agreement with your reasoning. If this thing is to >> > exist, it has to be opt-in. We also need to add ample warning why >> > doing this asynchronously behind user's back while the user could be >> > working in the same repository is prone to cause surprises in the >> > documentation in big red letters. >> > >> > I however am OK with the idea of having it as an optional "task" >> > that is disabled by default in "git maintenance". "git maintenance" >> > can be viewed as a platform-neutral way to set up scheduled tasks. >> > >> > A user may be a Git expert who knows what `git remote prune` does, >> > and understands and accepts the risk of confusion doing so without >> > explicit "do it now" from the end-user, but may not be a Linux or a >> > macOS or a Windows expert to know how to write crontab or its >> > equivalents on various schemes to define scheduled tasks. >> > >> > Thanks. >> >> Thanks for sharing your thoughts here. For some context — I currently >> look after git performance for a very large repository (60k+ refs) >> with a lot of active developers. >> I've observed that while power git users keep their repository tidy >> even without runnining maintenance, the majority of users infact don't >> know (or execute these), and >> run into performance issues. Easier fleet management was perhaps was >> one of the motivations behind adding this to `git-maintenance`. >> >> While its pretty rare that someone relies on a `refs/remote/*` >> reference (without creating a local copy of it in `refs/heads`), I can >> see why it can be surprising and an opt-in >> should be okay for me. >> >> Thanks, >> Shubham K If you want to mix a patch with other discussion, write a scissors line (e.g. "--- >8 --- cut here --- >8 ---" and nothing else on its own line) here before the patch begins, and tell readers that they need to scroll 50+ lines before they can see the patch at the beginning of your message if the leading part needs to be so long. > From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001 Never include this line in the body of the message. > From: Shubham Kanodia <shubham.kanodia10@gmail.com> Having "From:" in the body of the message at the beginning of your patch is a good way to override the author identity and set it differently from what your MUA places on the From: header line, when you are forwarding somebody else's patch and want to record the author identity of the real author, but then this line would not have your name-email. Another use case is when your MUA cannot be convinced to place your author identity on its From: header, but the message I am responding to has the same name-address in the From: header. So in conclusion, the above line is unnecessary in the body of this message. > Date: Tue, 17 Dec 2024 16:05:16 +0530 Rarely it is a good idea to include this line in the body of the message. The Date: header records the time the general public saw the patch for the first time, which is the author date as far as the general public is concered, much better. > Subject: [PATCH] maintenance: add prune-remote-refs task Overriding "Subject:" header of the e-mail with this in-body header is almost always needed when you are responding to an ongoing discussion and including a patch after a scissors line. So in short, "From/From:/Date:" should be dropped, "Subject:" kept, and insert a scissors line before it. > Remote-tracking refs can accumulate in local repositories even as branches > are deleted on remotes, impacting git performance. OK. It usually is better to be explicit about the direction of the "impact". E.g., "impacting git performance negatively". Have you considered using the reftable backend, instead of the files backend, by the way? It is rumored to scale much better with the number of refs. It is possible that the user deliberately kept stale remote-tracking branches even after they are removed from the origin because they used them as the bases of their topic branches, in which case it would not be a good idea to blindly remove them, but I do not think that needs to be said here. These users would not be running "git remote prune". > While git fetch --prune > helps clean these up, it requires a full fetch operation which is expensive > for large repositories. It is curious why "git fetch --prune" is mentioned here, especially when you know "git remote prune" is much more appropriate operation for that purpose. The real justification this paragraph needed to make is why it is a good idea to a new task in "git maintenance", rather than letting the user run "git remote prune" themselves, isn't it? > Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune' > for each configured remote. This provides an automated way to clean up stale > remote-tracking refs without requiring full fetches or manual intervention. Again, "requiring full fetches or" is probably out of place. > This task is disabled by default. If we wanted to mention "fetch --prune", it is probably not as an inferiour alternative to what you invented here, but as a different way the user can work that makes the problem you are trying to solve go away. Perhaps near the beginning, after the "Remote-tracking refs can accumulate and impact performance negatively", say something like: ... negatively. The user can have the "fetch.prune" configuration variable set to true when interacting with the other side, and stale remote-tracking refs will automatically pruned at zero cost. Of course "remote.<name>.prune" is a more targetted way that works per remote, but your prune-remote-refs works on all remotes, fetch.prune is a better contrast. In any case, stepping back a bit, for the population of user who benefit from enabing the prune-remote-refs task, wouldn't it be an even better solution for them to set fetch.prune? You can tell them to run "git remote prune" just once, set that configuration variable, and then the remote-tracking branches will stay clean from then on. Any future interactions with the remote make sure stale remote-tracking branches will be removed automatically. Wouldn't that be a much better option? I am sure I must be missing a use case where fetch.prune (or remote.<name>.prune) is not a good idea but background prune-remote-refs task works better. So, if we want to still add this new maintenance task, we should explain *that* use case after the proposed addition of "you can do with fetch.prune at zero cost" to the first paragraph. IOW, before starting "Add a new maintenance task ...", we need to say something like: There however are cases where pruning stale remote-tracking refs at the "fetch" time is undesirable, yet keeping them indefinitely is also unacceptable. <<HERE YOU DESCRIBE THE USE CASE WITH SUCH CHARACTERISTICS>>. To help such use cases, add a new maintenance task ... And if there is no such workflow where a background prune-remote-refs task is needed and fetch.prune would not work well, then we do not need a new task, do we? Let me review the change (assuming that there are cases where fetch.prune is not a good solution) below. > +prune-remote-refs:: > + The `prune-remote-refs` task runs `git remote prune` on each remote > + repository registered in the local repository. This task helps clean > + up deleted remote branches, improving the performance of operations > + that iterate through the refs. See linkgit:git-remote[1] for more > + information. This task is disabled by default and must be enabled > + manually. > ++ > +NOTE: This task is opt-in to prevent unexpected removal of remote refs > +for users of git-maintenance. It is recommended for power users who > +are comfortable with configuring maintenance tasks and understand the > +implications of automatic pruning. For others, running `git remote prune` > +manually might be more appropriate. "and must be enabled" is unnecessary---that is what "disabled by default" means. Also, we probably would want to add a sentence or two to contrast this with setting fetch.prune to highlight why users would want to consider enabling this (instead of setting fetch.prune). Otherwise, nicely written. > OPTIONS > ------- > --auto:: > diff --git a/builtin/gc.c b/builtin/gc.c > index 4ae5196aed..9acf1d2989 100644 > --- a/builtin/gc.c > +++ b/builtin/gc.c > @@ -20,6 +20,7 @@ > #include "lockfile.h" > #include "parse-options.h" > #include "run-command.h" > +#include "remote.h" > #include "sigchain.h" > #include "strvec.h" > #include "commit.h" > @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct > option *opt, const char *arg, > return 0; > } > > +static int collect_remote(struct remote *remote, void *cb_data) > +{ > + struct string_list *list = cb_data; > + > + if (!remote->url.nr) > + return 0; > + > + string_list_append(list, remote->name); > + return 0; > +} > + > +static int maintenance_task_prune_remote(struct maintenance_run_opts > *opts UNUSED, > + struct gc_config *cfg UNUSED) > +{ > + struct string_list_item *item; > + struct string_list remotes_list = STRING_LIST_INIT_NODUP; > + struct child_process child = CHILD_PROCESS_INIT; > + int result = 0; > + > + for_each_remote(collect_remote, &remotes_list); > + > + for_each_string_list_item (item, &remotes_list) { > + const char *remote_name = item->string; Is the patch whitespace-damaged? The indentation seems to be gone. I won't comment on misindented lines and blocks in this review (i.e. even if the original had such problems, I wouldn't be seeing them in the message I am responding to). > + child.git_cmd = 1; > + strvec_pushl(&child.args, "remote", "prune", remote_name, NULL); > + > + if (run_command(&child)) > + result = error(_("failed to prune '%s'"), remote_name); > + } > + > + string_list_clear(&remotes_list, 0); > + return result; > +} OK. > /* Remember to update object flag allocation in object.h */ > #define SEEN (1u<<0) > > @@ -1375,6 +1410,7 @@ enum maintenance_task_label { > TASK_GC, > TASK_COMMIT_GRAPH, > TASK_PACK_REFS, > + TASK_PRUNE_REMOTE_REFS, > > /* Leave as final value */ > TASK__COUNT > @@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = { > maintenance_task_pack_refs, > pack_refs_condition, > }, > + [TASK_PRUNE_REMOTE_REFS] = { > + "prune-remote-refs", > + maintenance_task_prune_remote, > + }, > }; > > static int compare_tasks_by_selection(const void *a_, const void *b_) > @@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void) > tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY; > tasks[TASK_PACK_REFS].enabled = 1; > tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY; > + tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0; > + tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY; > } > } OK. > diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh > index 0ce4ba1cbe..60a0c3f835 100755 > --- a/t/t7900-maintenance.sh > +++ b/t/t7900-maintenance.sh > @@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' ' > test_subcommand git pack-refs --all --prune <pack-refs.txt > ' > > +test_expect_success 'prune-remote-refs task not enabled by default' ' > + git clone . prune-test && > + ( > + cd prune-test && > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err && > + test_subcommand ! git remote prune origin <prune.txt > + ) > +' OK. > +test_expect_success 'prune-remote-refs task cleans stale remote refs' ' > + test_commit initial && > + > + # Create two separate remote repos > + git clone . remote1 && > + git clone . remote2 && > + > + git clone . prune-test-clean && > + ( > + cd prune-test-clean && > + git config maintenance.prune-remote-refs.enabled true && > + > + # Add both remotes > + git remote add remote1 "../remote1" && > + git remote add remote2 "../remote2" && > + > + # Create and push branches to both remotes > + git branch -f side2 HEAD && > + git push remote1 side2 && > + git push remote2 side2 && > + > + # Rename branches in each remote to simulate a stale branch > + git -C ../remote1 branch -m side2 side3 && > + git -C ../remote2 branch -m side2 side4 && > + > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run > --task=prune-remote-refs && > + > + # Verify pruning happened for both remotes > + test_subcommand git remote prune remote1 <prune.txt && > + test_subcommand git remote prune remote2 <prune.txt && > + test_must_fail git rev-parse refs/remotes/remote1/side2 && > + test_must_fail git rev-parse refs/remotes/remote2/side2 > + ) > +' OK. Other than that it is not clear if we want it, the change looks cleanly done and was a pleasant read (other than whitespace damages on the patch text). Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-17 19:56 ` Junio C Hamano @ 2024-12-18 8:30 ` Shubham Kanodia 2024-12-18 15:35 ` Junio C Hamano 0 siblings, 1 reply; 11+ messages in thread From: Shubham Kanodia @ 2024-12-18 8:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee On Wed, Dec 18, 2024 at 1:26 AM Junio C Hamano <gitster@pobox.com> wrote: > > Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > > > On Tue, Dec 17, 2024 at 4:51 PM Shubham Kanodia > > <shubham.kanodia10@gmail.com> wrote: > >> > >> On Tue, Dec 17, 2024 at 1:11 PM Junio C Hamano <gitster@pobox.com> wrote: > >> > > >> > Patrick Steinhardt <ps@pks.im> writes: > >> > > >> > > If we want to have such a feature I'd thus claim that it would be most > >> > > sensible to make it opt-in rather than opt-out. I wouldn't want to be > >> > > surprised by remote refs vanishing after going to bed, but may be okay > >> > > with it when I explicitly ask for it. > >> > > > >> > > At that point one has to raise the question whether it is still all that > >> > > useful compared to running `git remote prune` manually every now and > >> > > then. Mostly because explicitly configuring maintenance is probably > >> > > something that only power users would do, and those power users would > >> > > likely know to prune manually. > >> > > >> > I am 100% in agreement with your reasoning. If this thing is to > >> > exist, it has to be opt-in. We also need to add ample warning why > >> > doing this asynchronously behind user's back while the user could be > >> > working in the same repository is prone to cause surprises in the > >> > documentation in big red letters. > >> > > >> > I however am OK with the idea of having it as an optional "task" > >> > that is disabled by default in "git maintenance". "git maintenance" > >> > can be viewed as a platform-neutral way to set up scheduled tasks. > >> > > >> > A user may be a Git expert who knows what `git remote prune` does, > >> > and understands and accepts the risk of confusion doing so without > >> > explicit "do it now" from the end-user, but may not be a Linux or a > >> > macOS or a Windows expert to know how to write crontab or its > >> > equivalents on various schemes to define scheduled tasks. > >> > > >> > Thanks. > >> > >> Thanks for sharing your thoughts here. For some context — I currently > >> look after git performance for a very large repository (60k+ refs) > >> with a lot of active developers. > >> I've observed that while power git users keep their repository tidy > >> even without runnining maintenance, the majority of users infact don't > >> know (or execute these), and > >> run into performance issues. Easier fleet management was perhaps was > >> one of the motivations behind adding this to `git-maintenance`. > >> > >> While its pretty rare that someone relies on a `refs/remote/*` > >> reference (without creating a local copy of it in `refs/heads`), I can > >> see why it can be surprising and an opt-in > >> should be okay for me. > >> > >> Thanks, > >> Shubham K > > If you want to mix a patch with other discussion, write a scissors > line (e.g. "--- >8 --- cut here --- >8 ---" and nothing else on its > own line) here before the patch begins, and tell readers that they > need to scroll 50+ lines before they can see the patch at the > beginning of your message if the leading part needs to be so long. > > > From aed935055aff819cce591eecbf6697f8e8e7b99e Mon Sep 17 00:00:00 2001 > > Never include this line in the body of the message. > > > From: Shubham Kanodia <shubham.kanodia10@gmail.com> > > Having "From:" in the body of the message at the beginning of your > patch is a good way to override the author identity and set it > differently from what your MUA places on the From: header line, > when you are forwarding somebody else's patch and want to record the > author identity of the real author, but then this line would not > have your name-email. Another use case is when your MUA cannot be > convinced to place your author identity on its From: header, but the > message I am responding to has the same name-address in the From: > header. > > So in conclusion, the above line is unnecessary in the body of this > message. > > > Date: Tue, 17 Dec 2024 16:05:16 +0530 > > Rarely it is a good idea to include this line in the body of the > message. The Date: header records the time the general public saw > the patch for the first time, which is the author date as far as the > general public is concered, much better. > > > Subject: [PATCH] maintenance: add prune-remote-refs task > > Overriding "Subject:" header of the e-mail with this in-body header > is almost always needed when you are responding to an ongoing > discussion and including a patch after a scissors line. > > So in short, "From/From:/Date:" should be dropped, "Subject:" kept, > and insert a scissors line before it. > > > Remote-tracking refs can accumulate in local repositories even as branches > > are deleted on remotes, impacting git performance. > > OK. It usually is better to be explicit about the direction of the > "impact". E.g., "impacting git performance negatively". > > Have you considered using the reftable backend, instead of the files > backend, by the way? It is rumored to scale much better with the > number of refs. > > It is possible that the user deliberately kept stale remote-tracking > branches even after they are removed from the origin because they > used them as the bases of their topic branches, in which case it > would not be a good idea to blindly remove them, but I do not think > that needs to be said here. These users would not be running > "git remote prune". > > > While git fetch --prune > > helps clean these up, it requires a full fetch operation which is expensive > > for large repositories. > > It is curious why "git fetch --prune" is mentioned here, especially > when you know "git remote prune" is much more appropriate operation > for that purpose. The real justification this paragraph needed to > make is why it is a good idea to a new task in "git maintenance", > rather than letting the user run "git remote prune" themselves, > isn't it? > > > Add a new maintenance task 'prune-remote-refs' that runs 'git remote prune' > > for each configured remote. This provides an automated way to clean up stale > > remote-tracking refs without requiring full fetches or manual intervention. > > Again, "requiring full fetches or" is probably out of place. > > > This task is disabled by default. > > If we wanted to mention "fetch --prune", it is probably not as an > inferiour alternative to what you invented here, but as a different > way the user can work that makes the problem you are trying to solve > go away. Perhaps near the beginning, after the "Remote-tracking > refs can accumulate and impact performance negatively", say > something like: > > ... negatively. The user can have the "fetch.prune" > configuration variable set to true when interacting with the > other side, and stale remote-tracking refs will automatically > pruned at zero cost. > > Of course "remote.<name>.prune" is a more targetted way that works > per remote, but your prune-remote-refs works on all remotes, > fetch.prune is a better contrast. > > In any case, stepping back a bit, for the population of user who > benefit from enabing the prune-remote-refs task, wouldn't it be an > even better solution for them to set fetch.prune? You can tell them > to run "git remote prune" just once, set that configuration > variable, and then the remote-tracking branches will stay clean from > then on. Any future interactions with the remote make sure stale > remote-tracking branches will be removed automatically. Wouldn't > that be a much better option? I am sure I must be missing a use > case where fetch.prune (or remote.<name>.prune) is not a good idea > but background prune-remote-refs task works better. > > So, if we want to still add this new maintenance task, we should > explain *that* use case after the proposed addition of "you can do > with fetch.prune at zero cost" to the first paragraph. IOW, before > starting "Add a new maintenance task ...", we need to say something > like: > > There however are cases where pruning stale remote-tracking refs > at the "fetch" time is undesirable, yet keeping them indefinitely > is also unacceptable. <<HERE YOU DESCRIBE THE USE CASE WITH > SUCH CHARACTERISTICS>>. To help such use cases, add a new > maintenance task ... > > And if there is no such workflow where a background prune-remote-refs > task is needed and fetch.prune would not work well, then we do not > need a new task, do we? > > Let me review the change (assuming that there are cases where > fetch.prune is not a good solution) below. > > > +prune-remote-refs:: > > + The `prune-remote-refs` task runs `git remote prune` on each remote > > + repository registered in the local repository. This task helps clean > > + up deleted remote branches, improving the performance of operations > > + that iterate through the refs. See linkgit:git-remote[1] for more > > + information. This task is disabled by default and must be enabled > > + manually. > > ++ > > +NOTE: This task is opt-in to prevent unexpected removal of remote refs > > +for users of git-maintenance. It is recommended for power users who > > +are comfortable with configuring maintenance tasks and understand the > > +implications of automatic pruning. For others, running `git remote prune` > > +manually might be more appropriate. > > "and must be enabled" is unnecessary---that is what "disabled by > default" means. Also, we probably would want to add a sentence or > two to contrast this with setting fetch.prune to highlight why users > would want to consider enabling this (instead of setting fetch.prune). > > Otherwise, nicely written. > > > OPTIONS > > ------- > > --auto:: > > diff --git a/builtin/gc.c b/builtin/gc.c > > index 4ae5196aed..9acf1d2989 100644 > > --- a/builtin/gc.c > > +++ b/builtin/gc.c > > @@ -20,6 +20,7 @@ > > #include "lockfile.h" > > #include "parse-options.h" > > #include "run-command.h" > > +#include "remote.h" > > #include "sigchain.h" > > #include "strvec.h" > > #include "commit.h" > > @@ -913,6 +914,40 @@ static int maintenance_opt_schedule(const struct > > option *opt, const char *arg, > > return 0; > > } > > > > +static int collect_remote(struct remote *remote, void *cb_data) > > +{ > > + struct string_list *list = cb_data; > > + > > + if (!remote->url.nr) > > + return 0; > > + > > + string_list_append(list, remote->name); > > + return 0; > > +} > > + > > +static int maintenance_task_prune_remote(struct maintenance_run_opts > > *opts UNUSED, > > + struct gc_config *cfg UNUSED) > > +{ > > + struct string_list_item *item; > > + struct string_list remotes_list = STRING_LIST_INIT_NODUP; > > + struct child_process child = CHILD_PROCESS_INIT; > > + int result = 0; > > + > > + for_each_remote(collect_remote, &remotes_list); > > + > > + for_each_string_list_item (item, &remotes_list) { > > + const char *remote_name = item->string; > > Is the patch whitespace-damaged? The indentation seems to be gone. > I won't comment on misindented lines and blocks in this review (i.e. > even if the original had such problems, I wouldn't be seeing them in > the message I am responding to). > > > + child.git_cmd = 1; > > + strvec_pushl(&child.args, "remote", "prune", remote_name, NULL); > > + > > + if (run_command(&child)) > > + result = error(_("failed to prune '%s'"), remote_name); > > + } > > + > > + string_list_clear(&remotes_list, 0); > > + return result; > > +} > > OK. > > > /* Remember to update object flag allocation in object.h */ > > #define SEEN (1u<<0) > > > > @@ -1375,6 +1410,7 @@ enum maintenance_task_label { > > TASK_GC, > > TASK_COMMIT_GRAPH, > > TASK_PACK_REFS, > > + TASK_PRUNE_REMOTE_REFS, > > > > /* Leave as final value */ > > TASK__COUNT > > @@ -1411,6 +1447,10 @@ static struct maintenance_task tasks[] = { > > maintenance_task_pack_refs, > > pack_refs_condition, > > }, > > + [TASK_PRUNE_REMOTE_REFS] = { > > + "prune-remote-refs", > > + maintenance_task_prune_remote, > > + }, > > }; > > > > static int compare_tasks_by_selection(const void *a_, const void *b_) > > @@ -1505,6 +1545,8 @@ static void initialize_maintenance_strategy(void) > > tasks[TASK_LOOSE_OBJECTS].schedule = SCHEDULE_DAILY; > > tasks[TASK_PACK_REFS].enabled = 1; > > tasks[TASK_PACK_REFS].schedule = SCHEDULE_WEEKLY; > > + tasks[TASK_PRUNE_REMOTE_REFS].enabled = 0; > > + tasks[TASK_PRUNE_REMOTE_REFS].schedule = SCHEDULE_DAILY; > > } > > } > > OK. > > > diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh > > index 0ce4ba1cbe..60a0c3f835 100755 > > --- a/t/t7900-maintenance.sh > > +++ b/t/t7900-maintenance.sh > > @@ -446,6 +446,50 @@ test_expect_success 'pack-refs task' ' > > test_subcommand git pack-refs --all --prune <pack-refs.txt > > ' > > > > +test_expect_success 'prune-remote-refs task not enabled by default' ' > > + git clone . prune-test && > > + ( > > + cd prune-test && > > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run 2>err && > > + test_subcommand ! git remote prune origin <prune.txt > > + ) > > +' > > OK. > > > +test_expect_success 'prune-remote-refs task cleans stale remote refs' ' > > + test_commit initial && > > + > > + # Create two separate remote repos > > + git clone . remote1 && > > + git clone . remote2 && > > + > > + git clone . prune-test-clean && > > + ( > > + cd prune-test-clean && > > + git config maintenance.prune-remote-refs.enabled true && > > + > > + # Add both remotes > > + git remote add remote1 "../remote1" && > > + git remote add remote2 "../remote2" && > > + > > + # Create and push branches to both remotes > > + git branch -f side2 HEAD && > > + git push remote1 side2 && > > + git push remote2 side2 && > > + > > + # Rename branches in each remote to simulate a stale branch > > + git -C ../remote1 branch -m side2 side3 && > > + git -C ../remote2 branch -m side2 side4 && > > + > > + GIT_TRACE2_EVENT="$(pwd)/prune.txt" git maintenance run > > --task=prune-remote-refs && > > + > > + # Verify pruning happened for both remotes > > + test_subcommand git remote prune remote1 <prune.txt && > > + test_subcommand git remote prune remote2 <prune.txt && > > + test_must_fail git rev-parse refs/remotes/remote1/side2 && > > + test_must_fail git rev-parse refs/remotes/remote2/side2 > > + ) > > +' > > OK. > > Other than that it is not clear if we want it, the change looks > cleanly done and was a pleasant read (other than whitespace damages > on the patch text). > > Thanks. Let me address a few points here before I resend the patch in a more ideally formatted manner. > Have you considered using the reftable backend, instead of the files > backend, by the way? It is rumored to scale much better with the > number of refs. I don't think we've done that yet, but thanks for the pointer. In any case, till reftable is stable (including support for reflogs etc), we'd want to avoid switching all clients to use that, > In any case, stepping back a bit, for the population of user who > benefit from enabing the prune-remote-refs task, wouldn't it be an > even better solution for them to set fetch.prune? You can tell them > to run "git remote prune" just once, set that configuration > variable, and then the remote-tracking branches will stay clean from > then on. Any future interactions with the remote make sure stale > remote-tracking branches will be removed automatically. Wouldn't > that be a much better option? I am sure I must be missing a use > case where fetch.prune (or remote.<name>.prune) is not a good idea > but background prune-remote-refs task works better. Let me expand on the context for suggesting this change: I work with a large repository that has over 50k refs, with about 4k new ones added weekly. We have maintenance scripts on our git server that clean up stale refs (unused older than N months). Using `fetch.prune` with a normal git fetch isn't ideal because it would cause git fetch to unnecessarily download many new refs that users don't need. So we actively discourage that. In theory, users could just run `git remote prune` once and carefully avoid full fetches to keep their local ref count low. However, in practice, we've found that full fetches happen through various indirect means: - Shell plugins like zsh/pure - Git GUIs like Sourcetree - Code editors like VSCode among others. These tools sometimes perform background fetches without users realizing it, causing ref counts to gradually increase until users remember to run `git remote prune` again. Typically, most users care about: - Their local branches (refs/heads) - usually a few hundred in number - A small subset of remote branches they might work with from time to time (>99% of active remote refs are irrelevant to any individual developer in a large repo) The goal here is to help users who can already run `git remote prune` to do so periodically. - If full `git fetch` is completely avoided, this will gradually reduce the local ref count from tens of thousands to just a few hundred active refs (even if the remote has 50k+ active refs) as old branches on the remote expire with time. - Even if not —say, if an errant tool or the developer executes `git fetch` mistakenly, then the maintenance job ensures this doesn't become their permanent state until the next manual remote prune. Ultimately, this is about providing a better default for large-scale repositories (& users who choose to keep refs pruned). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-18 8:30 ` Shubham Kanodia @ 2024-12-18 15:35 ` Junio C Hamano 2024-12-22 12:19 ` Shubham Kanodia 0 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-12-18 15:35 UTC (permalink / raw) To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee Shubham Kanodia <shubham.kanodia10@gmail.com> writes: >> ... >> Thanks. [administrivia: respond inline, trim out parts that do not have to be read by bystanders to understand your response]. >> In any case, stepping back a bit, for the population of user who >> benefit from enabing the prune-remote-refs task, wouldn't it be an >> even better solution for them to set fetch.prune? You can tell them >> to run "git remote prune" just once, set that configuration >> variable, and then the remote-tracking branches will stay clean from >> then on. Any future interactions with the remote make sure stale >> remote-tracking branches will be removed automatically. Wouldn't >> that be a much better option? I am sure I must be missing a use >> case where fetch.prune (or remote.<name>.prune) is not a good idea >> but background prune-remote-refs task works better. > > Let me expand on the context for suggesting this change: > > I work with a large repository that has over 50k refs, with about 4k > new ones added weekly. > We have maintenance scripts on our git server that clean up stale refs > (unused older than N months). > Using `fetch.prune` with a normal git fetch isn't ideal because it > would cause git fetch to unnecessarily download many new refs that > users don't need. So we actively discourage that. This is what I did not quite understand. What do your users normally do to update their repository from the remote to become in sync, if they are not running "git fetch"? Side note: it is very likely that your users are not directly be running "git fetch", but using various front-ends like "git pull", "git pull --rebase", or even "repo", but they all at some point call "git fetch" to get the new objects and update refs. Ah, are they using "git fetch origin +foo:refs/remotes/origin/foo", i.e., only selectively fetch the thing that they use and nothing else (again, their wrappers may supply the refspec to do the limiting)? Now it slowly starts to make sense to me (sorry, I am slow, especially without caffeine in the morning). Am I following / guessing your set-up more or less correctly so far? In any case, if your users are doing selective fetching, 50k refs or 4k ref turnover per week on the other side does not really matter. Your users' desktop repositories won't see remote-tracking refs that they didn't use and ask for. But you are right that these selectively fetched refs will accumulate unless pruned, and fetch.prune would not prune anything when git fetch origin +foo:refs/remotes/origin/foo because it will not prune what is outside the hierarchy the refspec covers and this is a deliberate design decision. For "git fetch origin '+refs/heads/*:refs/remotes/origin/*'", which is pretty much how "git clone" sets up the remotes, anything we have in refs/remotes/origin/ hierarchy that do not appear in the current refs/heads/ hiearchy they have are pruned with fetch.prune=true. But if you fetch selectively, either 'foo' exists (in which case it won't be pruned), or 'foo' went away (in which case the fetch itself fails before even pruning what is on our end), so fetch.prune may not help. And at least for a shorter term, periodically running "remote prune" would be an acceptable workaround for such a workflow. In the longer term, I suspect that we may want a new option that lets you more aggressively prune your remote-tracking refs, telling the tool something like git fetch --prune-aggressive origin +refs/heads/foo:refs/remotes/origin/foo to mean "I only am interested in getting the object to complete their current 'foo' branch, and get my remote-tracking ref for that branch updated, BUT if you notice some ref in my refs/remotes/origin/* that they do not have in their refs/heads/*, please prune it, even when they are not 'foo' (which means normal --prune would not prune them)", would not be a terrible idea. It would be more involved than running "remote prune" periodically, of course. > In theory, users could just run `git remote prune` once and carefully > avoid full fetches to keep their local ref count low. > However, in practice, we've found that full fetches happen through > various indirect means: > > - Shell plugins like zsh/pure > - Git GUIs like Sourcetree > - Code editors like VSCode > > among others. And do any of these bypass underlying "git fetch"? If not, then one easier solution is to accept that somebody will do the regular refs/heads/*:refs/remotes/origin/* full-fetch *anyway*. Once we accept that to happen, we can tell "git fetch" to always prune. Then when these "various indirect means" attempt their full fetch, "git fetch" invoked by them would still honor fetch.prune, so even though they would try to maintain these 50k refs in-sync with the remote, which means you may see 4k new refs per week, but plausibly the refs that got retired are seen as stale and get removed on the users' repositories. > - If full `git fetch` is completely avoided, this will gradually > reduce the local ref count from tens of thousands to just a few > hundred active refs (even if the remote has 50k+ active refs) as old > branches on the remote expire with time. Yes. You'd somehow need to arrange these third-party tools not to fetch too much unneeded cruft. > - Even if not —say, if an errant tool or the developer executes `git > fetch` mistakenly, then the maintenance job ensures this doesn't > become their permanent state until the next manual remote prune. For the latter case, fetch.prune=true would be the ideal solution, I would imagine. The "errant tool"'s 'git fetch' would prune the stale ones. Now, the documentation should explain when this "periodically running remote prune" is an acceptable workaround and/or a useful solution, relative to setting fetch.prune, as most parts of the existing documentation do assume that the users, intended audience of the document, are using the bog-standard "git clone" result, that copies all their branches to remote-tracking branches. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-18 15:35 ` Junio C Hamano @ 2024-12-22 12:19 ` Shubham Kanodia 2024-12-23 4:21 ` Junio C Hamano 0 siblings, 1 reply; 11+ messages in thread From: Shubham Kanodia @ 2024-12-22 12:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee On Wed, Dec 18, 2024 at 9:05 PM Junio C Hamano <gitster@pobox.com> wrote: ... > Ah, are they using "git fetch origin +foo:refs/remotes/origin/foo", > i.e., only selectively fetch the thing that they use and nothing > else (again, their wrappers may supply the refspec to do the > limiting)? Now it slowly starts to make sense to me (sorry, I am > slow, especially without caffeine in the morning). > > Am I following / guessing your set-up more or less correctly so far? Yes. The root cause of the issue here is that the behaviour that `git fetch` / `git pull` fetches all refs by default is undesirable in large git repositories. It's almost never what you want to do. We advice users to execute `git fetch <remote> <ref>` when they want to run fetch (/pull) explicitly. Ideally, `git fetch` would only fetch the current branch when an explicit branch is not specified, and `git fetch --all` would pull in all. Now I believe git does provide a way to configure the default fetch refs via `remote.<name>.fetch`. So in theory I could just set — ``` [remote "origin"] fetch = +refs/heads/master:refs/remotes/origin/master ``` which would avoid someone pulling in all refs accidentally. However that has a side effect that now if you do want to fetch & start working on a remote ref that you weren't previously tracking, a command like ``` git fetch origin new-ref-branch-from-remote ``` no longer allows you to just start working on this new branch by doing a `git checkout new-ref-branch-from-remote`. If you wanted to be able to do that, you'd probably need to do — ``` git fetch origin new-ref-branch-from-remote --refmap=+refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote ``` which is pretty awkward to type everytime. Now to come back from this little digression, for now — - We ask users to set both `fetch.prune` and `fetch.pruneTags` to true (so that if a third party tool does do something, the damage is limited and they don't have an ever-growing list of refs) - Setup this job that cleans stale remote refs on a periodic basis, which means that their ref counts heal over time (if they configure all third party tools right) > Now, the documentation should explain when this "periodically running > remote prune" is an acceptable workaround and/or a useful solution, > relative to setting fetch.prune, as most parts of the existing > documentation do assume that the users, intended audience of the > document, are using the bog-standard "git clone" result, that copies > all their branches to remote-tracking branches. Agreed, will update docs to include that. Thanks, Shubham K ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-22 12:19 ` Shubham Kanodia @ 2024-12-23 4:21 ` Junio C Hamano 2024-12-23 9:30 ` Shubham Kanodia 0 siblings, 1 reply; 11+ messages in thread From: Junio C Hamano @ 2024-12-23 4:21 UTC (permalink / raw) To: Shubham Kanodia; +Cc: Patrick Steinhardt, git, Derrick Stolee Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > If you wanted to be able to do that, you'd probably need to do — > > ``` > git fetch origin new-ref-branch-from-remote > --refmap=+refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote > ``` > > which is pretty awkward to type everytime. Or have this in your .git/config [remote "origin"] fetch = +refs/heads/new-ref-branch-from-remote:refs/remotes/origin/new-ref-branch-from-remote and run $ git fetch origin new-ref-branch-from-remote If you plan to do the same for more random branches, you could instead do [remote "origin"] fetch = +refs/heads/*:refs/remotes/origin/* as long as you promise that you never run "git fetch" without saying which exact refs to fetch. I wonder if it helps to introduce a new configuration remote.*.fetchmap [remote "origin"] fetchmap = +refs/heads/*:refs/remotes/origin/* that only talks about how the mapping goes without saying what gets fetched by default, so that the lack of remote.origin.fetch would cause $ git fetch origin $ git fetch not to grab everything (like when you had the same specification as the value of remote.origin.fetch instead), but allows you to map whatever you grab from there when you did $ git fetch origin foo bar If the glob pattern given by the fetchmap participated in how fetch.prune behaves (i.e. I see refs/remotes/origin/baz here, but while trying to fetch refs/heads/{foo,bar}, I notice that they no longer advertise refs/heads/baz---let me prune that stale one), that would make it much simpler. > Now to come back from this little digression, for now — > - We ask users to set both `fetch.prune` and `fetch.pruneTags` to true > (so that if a third party tool does do something, the damage is > limited and they don't have an ever-growing list of refs) Good. > - Setup this job that cleans stale remote refs on a periodic basis, > which means that their ref counts heal over time (if they configure > all third party tools right) OK. And a scheduled task would make this part better. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consider adding pruning of refs to git maintenance 2024-12-23 4:21 ` Junio C Hamano @ 2024-12-23 9:30 ` Shubham Kanodia 0 siblings, 0 replies; 11+ messages in thread From: Shubham Kanodia @ 2024-12-23 9:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Derrick Stolee On Mon, Dec 23, 2024 at 9:51 AM Junio C Hamano <gitster@pobox.com> wrote: > as long as you promise that you never run "git fetch" without saying > which exact refs to fetch. I wonder if it helps to introduce a new > configuration remote.*.fetchmap > > [remote "origin"] > fetchmap = +refs/heads/*:refs/remotes/origin/* > > that only talks about how the mapping goes without saying what gets > fetched by default, so that the lack of remote.origin.fetch would > cause > > $ git fetch origin > $ git fetch > > not to grab everything (like when you had the same specification as > the value of remote.origin.fetch instead), but allows you to map > whatever you grab from there when you did This would be perfect really. Adding a new refmap every time you need to pull in changes from someone else's branch is too cumbersome. I'd be happy to attempt this as a follow up patch where `remote.<remote>.fetchmap` if you think that's possible to add. Also to keep things cleaner — I'm going to re-submit this patch using gitgitgadget separately. Thanks! Shubham K ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-12-23 9:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-16 12:33 Consider adding pruning of refs to git maintenance Shubham Kanodia 2024-12-17 7:21 ` Patrick Steinhardt 2024-12-17 7:41 ` Junio C Hamano 2024-12-17 11:21 ` Shubham Kanodia 2024-12-17 11:24 ` Shubham Kanodia 2024-12-17 19:56 ` Junio C Hamano 2024-12-18 8:30 ` Shubham Kanodia 2024-12-18 15:35 ` Junio C Hamano 2024-12-22 12:19 ` Shubham Kanodia 2024-12-23 4:21 ` Junio C Hamano 2024-12-23 9:30 ` Shubham Kanodia
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).