From: Phillip Wood <phillip.wood123@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org
Cc: peff@peff.net, jrnieder@google.com, stolee@gmail.com,
Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 03/15] run-job: implement fetch job
Date: Sun, 5 Apr 2020 16:14:38 +0100 [thread overview]
Message-ID: <0e924507-e77e-bff9-196a-e73f296a99d9@gmail.com> (raw)
In-Reply-To: <77b1da5d3063a2404cd750adfe3bb8be9b6c497d.1585946894.git.gitgitgadget@gmail.com>
Hi Stolee
On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
>
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
>
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
>
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
>
> 1. --no-tags prevents getting a new tag when a user wants to see
> the new tags appear in their foreground fetches.
>
> 2. --refmap= removes the configured refspec which usually updates
> refs/remotes/<remote>/* with the refs advertised by the remote.
>
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
> we can ensure that we actually load the new values somewhere in
> our refspace while not updating refs/heads or refs/remotes. By
> storing these refs here, the commit-graph job will update the
> commit-graph with the commits from these hidden refs.
>
> 4. --prune will delete the refs/hidden/<remote> refs that no
> longer appear on the remote.
>
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
>
> RFC QUESTIONS:
>
> 1. One downside of the refs/hidden pattern is that 'git log' will
> decorate commits with twice as many refs if they appear at a
> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
> there an easy way to exclude a refspace from decorations? Should
> we make refs/hidden/* a "special" refspace that is excluded from
> decorations?
Having some way to specify which refs outside of
refs/{heads,remote,tags}/ to show or exclude from decorations would be
useful I think. Fetching to a hidden ref is a good idea (as are the
other steps you outline above) but as you say we don't want it to show
up in the output of 'git log' etc.
> 2. This feature is designed for a desktop machine or equivalent
> that has a permanent wired network connection, and the machine
> stays on while the user is not present. For things like laptops
> with inconsistent WiFi connections (that may be metered) the
> feature can use the less stable connection more than the user
> wants. Of course, this feature is opt-in for Git, but in Scalar
> we have a "scalar pause" command [2] that pauses all maintenance
> for some amount of time. We should consider a similar mechanism
> for Git, but for the point of this series the user needs to set
> up the "background" part of these jobs manually.
>
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 13 ++++-
> builtin/run-job.c | 89 ++++++++++++++++++++++++++++++++++-
> t/t7900-run-job.sh | 22 +++++++++
> 3 files changed, 122 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>
>
> DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
> `commit-graph-chain` file. They will be deleted by a later run based on
> the expiration delay.
>
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>
> GIT
> ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
> #include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job commit-graph"),
> + N_("git run-job (commit-graph|fetch)"),
> NULL
> };
>
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
> return 1;
> }
>
> +static int fetch_remote(const char *remote)
> +{
> + int result;
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + struct strbuf refmap = STRBUF_INIT;
> +
> + argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> + "--no-tags", "--refmap=", NULL);
> +
> + strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> + argv_array_push(&cmd, refmap.buf);
> +
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + strbuf_release(&refmap);
> + return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
Isn't there a easy way to get this using the config api rather than
forking 'git remote'?
Best Wishes
Phillip
> +{
> + int result = 0;
> + FILE *proc_out;
> + struct strbuf line = STRBUF_INIT;
> + struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
> +
> + child_process_init(remote_proc);
> +
> + argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> + remote_proc->out = -1;
> +
> + if (start_command(remote_proc)) {
> + error(_("failed to start 'git remote' process"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + proc_out = xfdopen(remote_proc->out, "r");
> +
> + /* if there is no line, leave the value as given */
> + while (!strbuf_getline(&line, proc_out))
> + string_list_append(remotes, line.buf);
> +
> + strbuf_release(&line);
> +
> + fclose(proc_out);
> +
> + if (finish_command(remote_proc)) {
> + error(_("failed to finish 'git remote' process"));
> + result = 1;
> + }
> +
> +cleanup:
> + free(remote_proc);
> + return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> + int result = 0;
> + struct string_list_item *item;
> + struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> + if (fill_remotes(&remotes)) {
> + error(_("failed to fill remotes"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + /*
> + * Do not modify the result based on the success of the 'fetch'
> + * operation, as a loss of network could cause 'fetch' to fail
> + * quickly. We do not want that to stop the rest of our
> + * background operations.
> + */
> + for (item = remotes.items;
> + item && item < remotes.items + remotes.nr;
> + item++)
> + fetch_remote(item->string);
> +
> +cleanup:
> + string_list_clear(&remotes, 0);
> + return result;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> if (argc > 0) {
> if (!strcmp(argv[0], "commit-graph"))
> return run_commit_graph_job();
> + if (!strcmp(argv[0], "fetch"))
> + return run_fetch_job();
> }
>
> usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
> )
> '
>
> +test_expect_success 'fetch job' '
> + git clone "file://$(pwd)/server" client &&
> +
> + # Before fetching, build a client commit-graph
> + git -C client run-job commit-graph &&
> + chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> + test_line_count = 1 $chain &&
> +
> + git -C client branch -v --remotes >before-refs &&
> + test_commit -C server 24 &&
> +
> + git -C client run-job fetch &&
> + git -C client branch -v --remotes >after-refs &&
> + test_cmp before-refs after-refs &&
> + test_cmp server/.git/refs/heads/master \
> + client/.git/refs/hidden/origin/master &&
> +
> + # the hidden ref should trigger a new layer in the commit-graph
> + git -C client run-job commit-graph &&
> + test_line_count = 2 $chain
> +'
> +
> test_done
>
next prev parent reply other threads:[~2020-04-05 15:14 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10 ` Phillip Wood
2020-04-05 19:21 ` Junio C Hamano
2020-04-06 14:42 ` Derrick Stolee
2020-04-07 0:58 ` Danh Doan
2020-04-07 10:54 ` Derrick Stolee
2020-04-07 14:16 ` Danh Doan
2020-04-07 14:30 ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08 ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14 ` Phillip Wood [this message]
2020-04-06 12:48 ` Derrick Stolee
2020-04-05 20:28 ` Junio C Hamano
2020-04-06 12:46 ` Derrick Stolee
2020-05-20 19:08 ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33 ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17 ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18 ` Phillip Wood
2020-04-06 12:49 ` Derrick Stolee
2020-04-05 15:41 ` Phillip Wood
2020-04-06 12:57 ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24 ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04 0:16 ` Derrick Stolee
2020-04-07 0:50 ` Danh Doan
2020-04-07 10:59 ` Derrick Stolee
2020-04-07 14:26 ` Danh Doan
2020-04-07 14:43 ` Johannes Schindelin
2020-04-07 1:48 ` brian m. carlson
2020-04-07 20:08 ` Junio C Hamano
2020-04-07 22:23 ` Johannes Schindelin
2020-04-08 0:01 ` brian m. carlson
2020-05-27 22:39 ` Josh Steadmon
2020-05-28 0:47 ` Junio C Hamano
2020-05-27 21:52 ` Johannes Schindelin
2020-05-28 14:48 ` Junio C Hamano
2020-05-28 14:50 ` Jonathan Nieder
2020-05-28 14:57 ` Junio C Hamano
2020-05-28 15:03 ` Jonathan Nieder
2020-05-28 15:30 ` Derrick Stolee
2020-05-28 4:39 ` Johannes Schindelin
-- strict thread matches above, loose matches on Subject: below --
2020-04-13 13:15 [PATCH 03/15] run-job: implement fetch job Son Luong Ngoc
2020-04-13 14:14 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0e924507-e77e-bff9-196a-e73f296a99d9@gmail.com \
--to=phillip.wood123@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=jrnieder@google.com \
--cc=peff@peff.net \
--cc=phillip.wood@dunelm.org.uk \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.