Re: Parallelism for submodule update

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Calvin Wan <calvinwan@google.com>
To: Christian.Zitzmann@vitesco.com
Cc: Calvin Wan <calvinwan@google.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Parallelism for submodule update
Date: Thu, 19 Jan 2023 21:39:11 +0000	[thread overview]
Message-ID: <20230119213911.1515188-1-calvinwan@google.com> (raw)
In-Reply-To: <DB5PR02MB100691E6422F5E94228F0E0EC8AF79@DB5PR02MB10069.eurprd02.prod.outlook.com>

Hi Christian,

I investigated this as well about 2 months ago and am happy to share my
findings with you :)

> When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially

Correct.

> What I’ve recognized when cloning with
> - scalar clone --full-clone --recurse-submodules <URL>
> or
> - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL>
> 
> We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part.
> 
> Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout)

Also an astute observation that separating out the parallelization of
fetch and checkout doesn't allow us to fully use our resources.

> As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part.
> E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2
> 
> I expect that this significantly improves the performance, especially when using partial clones.
> 
> Do you think this is possible? Do I miss anything in my thoughts?

Sort of. The issue with run_processes_parallel_tr2 is that it creates a
subprocess with a git command. There is no git command that we can call
that lets us do both the correct fetch and checkout command, so first
you would have to create a new option/command for that (and what happens
if we want to add to that parallelization in the future? Create another
option/command?). I think we can do better than that!

`git submodule update`, when called from clone, essentially does 4
things to the submodule: init, clone, checkout, and recursively calls
itself for child submodules. One idea I had was to separate out the
individual tasks that `git submodule update` does and create a new
submodule--helper command (eg. git submodule--helper update-helper) that
calls those individual tasks. Then, clone would directly call
run_processes_parallel_tr2 with the new submodule--helper command and
each process separated by submodule.

This is what I imagine the general idea of what
`git clone --recurse-submodules` would look like:
superproject cloning
run_processes_parallel_tr2(git submodule--helper update-helper)
        Init
        Clone
        Checkout
        Recursive git submdodule update-helper

I'll discuss what I think are the benefits of this approach:
- The entirety of submodule update would be parallelized so network and
  hard disk resources can be used together
- There only needs to be one config option that controls how many
  parallel processes to spawn
- Any new features to submodule update are automatically parallelized

The drawback is that any new feature that would cause a race condition
if run in parallel would have to have additional locking code written
for it since separating it out would be difficult. In this case, only
adding lines to .gitmodules in init is at risk of a race condition, but
fortunately that can be handled first in series before running
everything else in parallel.

I haven't started implementing this and am not planning to fix this in
the near future. This is because we are planning a more long-term
solution (2y+) to solve problems like this (notice how much simpler it
would've been to add parallelization if we didn't have to create
subprocesses for every separate git command and instead could call from
a variety of library functions). So if you need the parallelizations
sooner or want to scratch your itch, you're more than welcome to
implement it. Happy to bounce ideas off of and review any patches for
this!

Thanks,
Calvin

     prev parent reply	other threads:[~2023-01-19 22:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian
2023-01-02 16:54 ` rsbecker
2023-01-13 10:49   ` Zitzmann, Christian
2023-01-19 21:39 ` Calvin Wan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230119213911.1515188-1-calvinwan@google.com \
    --to=calvinwan@google.com \
    --cc=Christian.Zitzmann@vitesco.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).