git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Calvin Wan <calvinwan@google.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, steamdon@google.com, emrass@google.com,
	 me@ttaylorr.com, stolee@gmail.com
Subject: Re: [RFC PATCH 1/1] maintenance: separate parallelism safe and unsafe tasks
Date: Mon, 11 Nov 2024 10:06:10 -0800	[thread overview]
Message-ID: <CAFySSZCzxfqpMWH5ORv8fYb7f5WU3Fc2N99fW33wD9JOcYVrVA@mail.gmail.com> (raw)
In-Reply-To: <ZzGtD4Jz9Wj6n0zH@pks.im>

On Sun, Nov 10, 2024 at 11:07 PM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, Nov 08, 2024 at 05:31:12PM +0000, Calvin Wan wrote:
> > Certain maintenance tasks and subtasks within gc are unsafe to run in
> > parallel with other commands because they lock up files such as
> > HEAD.
>
> I don't think it is fair to classify this as "unsafe". Nothing is unsafe
> here: we take locks to guard us against concurrent modifications.
> What you're having problems with is the fact that this safety mechanism
> works as expected and keeps other processes from modifying locked the
> data.
>
> > Therefore, tasks are marked whether they are async safe or
> > not. Async unsafe tasks are run first in the same process before running
> > async safe tasks in parallel.
> >
> > Since the gc task is partially safe, there are two new tasks -- an async
> > safe gc task and an async unsafe gc task. In order to properly invoke
> > this in gc, `--run-async-safe` and `--run-async-unsafe` have been added
> > as options to gc. Maintenance will only run these two new tasks if it
> > was set to detach, otherwise the original gc task runs.
> >
> > Additionally, if a user passes in tasks thru `--task`, we do not attempt
> > to run separate async/sync tasks since the user sets the order of tasks.
> >
> > WIP: automatically run gc unsafe tasks when gc is invoked but not from
> >      maintenance
> > WIP: edit test in t7900-maintainance.sh to match new functionality
> > WIP: add additional documentation for new options and functionality
> >
> > Signed-off-by: Calvin Wan <calvinwan@google.com>
> > ---
> >  builtin/gc.c           | 173 ++++++++++++++++++++++++++++++++++++-----
> >  t/t7900-maintenance.sh |  24 +++---
> >  2 files changed, 168 insertions(+), 29 deletions(-)
> >
> > diff --git a/builtin/gc.c b/builtin/gc.c
> > index d52735354c..375d304c42 100644
> > --- a/builtin/gc.c
> > +++ b/builtin/gc.c
>
> It might make sense to split out the git-gc(1) changes into a
> preparatory commit with its own set of tests.
>
> > @@ -815,7 +824,12 @@ struct repository *repo UNUSED)
> >               atexit(process_log_file_at_exit);
> >       }
> >
> > -     gc_before_repack(&opts, &cfg);
> > +     if (run_async_unsafe) {
> > +             gc_before_repack(&opts, &cfg);
> > +             goto out;
> > +     } else if (!run_async_safe)
> > +             gc_before_repack(&opts, &cfg);
> > +
> >
>
> Style: there should be curly braces around the `else if` here.
>
> >       if (!repository_format_precious_objects) {
> >               struct child_process repack_cmd = CHILD_PROCESS_INIT;
> > @@ -1052,6 +1066,46 @@ static int maintenance_task_prefetch(struct maintenance_run_opts *opts,
> >       return 0;
> >  }
> >
> > +static int maintenance_task_unsafe_gc(struct maintenance_run_opts *opts,
> > +                                   struct gc_config *cfg UNUSED)
> > +{
> > +     struct child_process child = CHILD_PROCESS_INIT;
> > +
> > +     child.git_cmd = child.close_object_store = 1;
> > +     strvec_push(&child.args, "gc");
> > +
> > +     if (opts->auto_flag)
> > +             strvec_push(&child.args, "--auto");
> > +     if (opts->quiet)
> > +             strvec_push(&child.args, "--quiet");
> > +     else
> > +             strvec_push(&child.args, "--no-quiet");
> > +     strvec_push(&child.args, "--no-detach");
> > +     strvec_push(&child.args, "--run-async-unsafe");
> > +
> > +     return run_command(&child);
> > +}
> > +
> > +static int maintenance_task_safe_gc(struct maintenance_run_opts *opts,
> > +                                 struct gc_config *cfg UNUSED)
> > +{
> > +     struct child_process child = CHILD_PROCESS_INIT;
> > +
> > +     child.git_cmd = child.close_object_store = 1;
> > +     strvec_push(&child.args, "gc");
> > +
> > +     if (opts->auto_flag)
> > +             strvec_push(&child.args, "--auto");
> > +     if (opts->quiet)
> > +             strvec_push(&child.args, "--quiet");
> > +     else
> > +             strvec_push(&child.args, "--no-quiet");
> > +     strvec_push(&child.args, "--no-detach");
> > +     strvec_push(&child.args, "--run-async-safe");
> > +
> > +     return run_command(&child);
> > +}
>
> These two functions and `maintenance_task_gc()` all look exactly the
> same. We should deduplicate them.
>
> >  static int maintenance_task_gc(struct maintenance_run_opts *opts,
> >                              struct gc_config *cfg UNUSED)
> >  {
> > @@ -1350,6 +1404,7 @@ struct maintenance_task {
> >       const char *name;
> >       maintenance_task_fn *fn;
> >       maintenance_auto_fn *auto_condition;
> > +     unsigned daemonize_safe;
>
> We can use the enum here to give readers a better hint what this
> variable is about.
>
> >       unsigned enabled:1;
> >
> >       enum schedule_priority schedule;
> > @@ -1362,6 +1417,8 @@ enum maintenance_task_label {
> >       TASK_PREFETCH,
> >       TASK_LOOSE_OBJECTS,
> >       TASK_INCREMENTAL_REPACK,
> > +     TASK_UNSAFE_GC,
> > +     TASK_SAFE_GC,
> >       TASK_GC,
> >       TASK_COMMIT_GRAPH,
> >       TASK_PACK_REFS,
> > @@ -1370,36 +1427,62 @@ enum maintenance_task_label {
> >       TASK__COUNT
> >  };
> >
> > +enum maintenance_task_daemonize_safe {
> > +     UNSAFE,
> > +     SAFE,
> > +};
>
> These names can conflict quite fast. Do we maybe want to rename them to
> e.g. `MAINTENANCE_TASK_DAEMONIZE_(SAFE|UNSAFE)`?
>
> >  static struct maintenance_task tasks[] = {
> >       [TASK_PREFETCH] = {
> >               "prefetch",
> >               maintenance_task_prefetch,
> > +             NULL,
> > +             SAFE,
> >       },
>
> It might make sense to prepare these to take designated field
> initializers in a preparatory commit.

Thanks for all the stylistic feedback. I agree much of this can be
cleaned up to be simpler, but I sent this as an RFC to gather feedback
on whether this patch directionally made sense. Will clean everything
up in the v1.

>
> >       [TASK_LOOSE_OBJECTS] = {
> >               "loose-objects",
> >               maintenance_task_loose_objects,
> >               loose_object_auto_condition,
> > +             SAFE,
> >       },
> >       [TASK_INCREMENTAL_REPACK] = {
> >               "incremental-repack",
> >               maintenance_task_incremental_repack,
> >               incremental_repack_auto_condition,
> > +             SAFE,
> > +     },
> > +     [TASK_UNSAFE_GC] = {
> > +             "unsafe-gc",
> > +             maintenance_task_unsafe_gc,
> > +             need_to_gc,
> > +             UNSAFE,
> > +             0,
> > +     },
> > +     [TASK_SAFE_GC] = {
> > +             "safe-gc",
> > +             maintenance_task_safe_gc,
> > +             need_to_gc,
> > +             SAFE,
> > +             0,
> >       },
>
> Hm. I wonder whether we really want to expose additional tasks to
> address the issue, which feels like we're leaking implementation details
> to our users. Would it maybe be preferable to instead introduce a new
> optional callback function for every task that handles the pre-detach
> logic?

This does sound like a good idea. However, would there be any issue
with running all pre-detach logic before running post-detach logic?
I'm thinking if pre-detach logic from a different function could
affect post-detach logic from another. If not, I do agree this would
be the best solution going forward.

> I wonder whether we also have to adapt the "pack-refs" task to be
> synchronous instead of asynchronous?

  reply	other threads:[~2024-11-11 18:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-08 17:31 [RFC PATCH 0/1] maintenance: separate parallelism safe and unsafe tasks Calvin Wan
2024-11-08 17:31 ` [RFC PATCH 1/1] " Calvin Wan
2024-11-11  7:07   ` Patrick Steinhardt
2024-11-11 18:06     ` Calvin Wan [this message]
2024-11-12  6:28       ` Patrick Steinhardt
2024-11-15 20:13         ` Calvin Wan
2024-11-18  1:32           ` Junio C Hamano
2024-11-18  6:58           ` Patrick Steinhardt
2024-11-11  8:24   ` Junio C Hamano
2024-11-11  9:14   ` Junio C Hamano
2024-11-11 18:12     ` Calvin Wan
2024-11-11  4:50 ` [RFC PATCH 0/1] " Junio C Hamano
2024-11-11 18:39   ` Calvin Wan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFySSZCzxfqpMWH5ORv8fYb7f5WU3Fc2N99fW33wD9JOcYVrVA@mail.gmail.com \
    --to=calvinwan@google.com \
    --cc=emrass@google.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=ps@pks.im \
    --cc=steamdon@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).