git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: ZheNing Hu <adlternative@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	johncai86@gmail.com
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Mon, 10 Apr 2023 16:14:14 -0400	[thread overview]
Message-ID: <20230410201414.GC104097@coredump.intra.peff.net> (raw)
In-Reply-To: <CAOLTT8RbU6G67BtE9fSv4gEn10dtR7cT-jf+dcEfhvNhvcwETQ@mail.gmail.com>

On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:

> > Perhaps slightly so, since there is naturally going to be some
> > duplicated effort spawning processes, loading any shared libraries,
> > initializing the repository and reading its configuration, etc.
> >
> > But I'd wager that these are all a negligible cost when compared to the
> > time we'll have to spend reading, inflating, and printing out all of the
> > objects in your repository.
> 
> "What you said makes sense. I implemented the --type-filter option for
> git cat-file and compared the performance of outputting all blobs in the
> git repository with and without using the type-filter. I found that the
> difference was not significant.
> 
> time git  cat-file --batch-all-objects --batch-check="%(objectname)
> %(objecttype)" |
> awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> 17.10s user 0.27s system 102% cpu 16.987 total
> 
> time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> 16.74s user 0.19s system 95% cpu 17.655 total
> 
> At first, I thought the processes that provide all blob oids by using
> git rev-list or git cat-file --batch-all-objects --batch-check might waste
> cpu, io, memory resources because they need to read a large number
> of objects, and then they are read again by git cat-file --batch.
> However, it seems that this is not actually the bottleneck in performance.

Yeah, I think most of your time there is spent on the --batch command
itself, which is just putting through a lot of bytes. You might also try
with "--unordered". The default ordering for --batch-all-objects is in
sha1 order, which has pretty bad locality characteristics for delta
caching. Using --unordered goes in pack-order, which should be optimal.

E.g., in git.git, running:

  time \
    git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
    perl -lne 'print $1 if /^blob (.*)/' |
    git cat-file --batch >/dev/null

takes:

  real	0m29.961s
  user	0m29.128s
  sys	0m1.461s

Adding "--unordered" to the initial cat-file gives:

  real	0m1.970s
  user	0m2.170s
  sys	0m0.126s

So reducing the size of the actual --batch printing may make the
relative cost of using multiple processes much higher (I didn't apply
your --type-filter patches to test myself).

In general, I do think having a processing pipeline like this is OK, as
it's pretty flexible. But especially for smaller queries (even ones that
don't ask for the whole object contents), the per-object lookup costs
can start to dominate (especially in a repository that hasn't been
recently packed). Right now, even your "--batch --type-filter" example
is probably making at least two lookups per object, because we don't
have a way to open a "handle" to an object to check its type, and then
extract the contents conditionally. And of course with multiple
processes, we're naturally doing a separate lookup in each one.

So a nice thing about being able to do the filtering in one process is
that we could _eventually_ do it all with one object lookup. But I'd
probably wait on adding something like --type-filter until we have an
internal single-lookup API, and then we could time it to see how much
speedup we can get.

-Peff

  reply	other threads:[~2023-04-10 20:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
2023-04-07 16:30 ` Junio C Hamano
2023-04-08  6:27   ` ZheNing Hu
2023-04-09  1:28     ` Taylor Blau
2023-04-09  2:19       ` Taylor Blau
2023-04-09  2:26         ` Taylor Blau
2023-04-09  6:51           ` ZheNing Hu
2023-04-10 20:01             ` Jeff King
2023-04-10 23:20               ` Taylor Blau
2023-04-09  6:47       ` ZheNing Hu
2023-04-10 20:14         ` Jeff King [this message]
2023-04-11 14:09           ` ZheNing Hu
2023-04-12  7:43             ` Jeff King
2023-04-12  9:57               ` ZheNing Hu
2023-04-14  7:30                 ` Jeff King
2023-04-14 12:17                   ` ZheNing Hu
2023-04-14 15:58                     ` Junio C Hamano
2023-04-16 11:15                       ` ZheNing Hu
2023-04-14 17:04                     ` Linus Torvalds
2023-04-16 12:06                       ` Felipe Contreras
2023-04-16 12:43                       ` ZheNing Hu
2023-04-09  1:26   ` Taylor Blau
2023-04-09  1:23 ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230410201414.GC104097@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=adlternative@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johncai86@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).