From: Han Young <hanyang.tony@bytedance.com>
To: git@vger.kernel.org
Cc: Han Young <hanyoung@protonmail.com>
Subject: [RFC PATCH 0/4] add parallel unlink
Date: Sun, 3 Dec 2023 21:39:07 +0800 [thread overview]
Message-ID: <20231203133911.41594-1-hanyoung@protonmail.com> (raw)
We have had parallel_checkout option since 04155bdad, but the unlink is still performed single threaded.
With a very large repository, directory rename or reorganization can lead to a large amount of unlinked entries.
In some instance, the unlink process can be slower than the parallel checkout.
This series of patches introduces basic support for parallel unlink. The removal of individual files
can be easily multithreaded, but removing empty directories is a little tricky.
If one thread decides to remove the directory, it may still have files that need to be deleted by
another thread. I had to use a mutex-guarded hashset to collect these 'race' directories,
and remove them after all threads have been joined. Maybe there are ways to do this
without mutex and hashmap?
The speed of unlinking files seems to vary from system to system. I did some tests with a private repo.
When I checkout a commit with 15000 moved files on a Linux machine with btrfs, parallel_unlink yields
10% speed up. But on a Intel MacBook Pro with APFS, the speed up is over 100%. I find it difficult to
choose the default threshold of parallel_unlink.
This series is by no means complete. Many functions contains duplicated code, and there are some
memory leaks. I want to know the community opinion before proceed, if it's worth doing or a waste of time.
Han Young (4):
symlinks: add and export threaded rmdir variants
entry: add threaded_unlink_entry function
parallel-checkout: add parallel_unlink
unpack-trees: introduce parallel_unlink
entry.c | 16 ++++++
entry.h | 3 ++
parallel-checkout.c | 80 +++++++++++++++++++++++++++++
parallel-checkout.h | 25 +++++++++
symlinks.c | 120 ++++++++++++++++++++++++++++++++++++++++++--
symlinks.h | 6 +++
unpack-trees.c | 15 +-----
7 files changed, 249 insertions(+), 16 deletions(-)
--
2.43.0
next reply other threads:[~2023-12-03 13:39 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-03 13:39 Han Young [this message]
2023-12-03 13:39 ` [RFC PATCH 1/4] symlinks: add and export threaded rmdir variants Han Young
2023-12-03 13:39 ` [RFC PATCH 2/4] entry: add threaded_unlink_entry function Han Young
2023-12-03 13:39 ` [RFC PATCH 3/4] parallel-checkout: add parallel_unlink Han Young
2023-12-03 13:39 ` [RFC PATCH 4/4] unpack-trees: introduce parallel_unlink Han Young
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231203133911.41594-1-hanyoung@protonmail.com \
--to=hanyang.tony@bytedance.com \
--cc=git@vger.kernel.org \
--cc=hanyoung@protonmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox