From: Jeff King <peff@peff.net>
To: ZheNing Hu <adlternative@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
Git List <git@vger.kernel.org>,
johncai86@gmail.com
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Mon, 10 Apr 2023 16:14:14 -0400 [thread overview]
Message-ID: <20230410201414.GC104097@coredump.intra.peff.net> (raw)
In-Reply-To: <CAOLTT8RbU6G67BtE9fSv4gEn10dtR7cT-jf+dcEfhvNhvcwETQ@mail.gmail.com>
On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:
> > Perhaps slightly so, since there is naturally going to be some
> > duplicated effort spawning processes, loading any shared libraries,
> > initializing the repository and reading its configuration, etc.
> >
> > But I'd wager that these are all a negligible cost when compared to the
> > time we'll have to spend reading, inflating, and printing out all of the
> > objects in your repository.
>
> "What you said makes sense. I implemented the --type-filter option for
> git cat-file and compared the performance of outputting all blobs in the
> git repository with and without using the type-filter. I found that the
> difference was not significant.
>
> time git cat-file --batch-all-objects --batch-check="%(objectname)
> %(objecttype)" |
> awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> 17.10s user 0.27s system 102% cpu 16.987 total
>
> time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> 16.74s user 0.19s system 95% cpu 17.655 total
>
> At first, I thought the processes that provide all blob oids by using
> git rev-list or git cat-file --batch-all-objects --batch-check might waste
> cpu, io, memory resources because they need to read a large number
> of objects, and then they are read again by git cat-file --batch.
> However, it seems that this is not actually the bottleneck in performance.
Yeah, I think most of your time there is spent on the --batch command
itself, which is just putting through a lot of bytes. You might also try
with "--unordered". The default ordering for --batch-all-objects is in
sha1 order, which has pretty bad locality characteristics for delta
caching. Using --unordered goes in pack-order, which should be optimal.
E.g., in git.git, running:
time \
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
perl -lne 'print $1 if /^blob (.*)/' |
git cat-file --batch >/dev/null
takes:
real 0m29.961s
user 0m29.128s
sys 0m1.461s
Adding "--unordered" to the initial cat-file gives:
real 0m1.970s
user 0m2.170s
sys 0m0.126s
So reducing the size of the actual --batch printing may make the
relative cost of using multiple processes much higher (I didn't apply
your --type-filter patches to test myself).
In general, I do think having a processing pipeline like this is OK, as
it's pretty flexible. But especially for smaller queries (even ones that
don't ask for the whole object contents), the per-object lookup costs
can start to dominate (especially in a repository that hasn't been
recently packed). Right now, even your "--batch --type-filter" example
is probably making at least two lookups per object, because we don't
have a way to open a "handle" to an object to check its type, and then
extract the contents conditionally. And of course with multiple
processes, we're naturally doing a separate lookup in each one.
So a nice thing about being able to do the filtering in one process is
that we could _eventually_ do it all with one object lookup. But I'd
probably wait on adding something like --type-filter until we have an
internal single-lookup API, and then we could time it to see how much
speedup we can get.
-Peff
next prev parent reply other threads:[~2023-04-10 20:14 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
2023-04-07 16:30 ` Junio C Hamano
2023-04-08 6:27 ` ZheNing Hu
2023-04-09 1:28 ` Taylor Blau
2023-04-09 2:19 ` Taylor Blau
2023-04-09 2:26 ` Taylor Blau
2023-04-09 6:51 ` ZheNing Hu
2023-04-10 20:01 ` Jeff King
2023-04-10 23:20 ` Taylor Blau
2023-04-09 6:47 ` ZheNing Hu
2023-04-10 20:14 ` Jeff King [this message]
2023-04-11 14:09 ` ZheNing Hu
2023-04-12 7:43 ` Jeff King
2023-04-12 9:57 ` ZheNing Hu
2023-04-14 7:30 ` Jeff King
2023-04-14 12:17 ` ZheNing Hu
2023-04-14 15:58 ` Junio C Hamano
2023-04-16 11:15 ` ZheNing Hu
2023-04-14 17:04 ` Linus Torvalds
2023-04-16 12:06 ` Felipe Contreras
2023-04-16 12:43 ` ZheNing Hu
2023-04-09 1:26 ` Taylor Blau
2023-04-09 1:23 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230410201414.GC104097@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=adlternative@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=johncai86@gmail.com \
--cc=me@ttaylorr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).