Re: Fast enumeration of objects

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Charles Bailey <charles@hashpling.org>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: Fast enumeration of objects
Date: Mon, 22 Jun 2015 04:35:44 -0400	[thread overview]
Message-ID: <20150622083543.GA12259@peff.net> (raw)
In-Reply-To: <1434914431-7745-1-git-send-email-charles@hashpling.org>

On Sun, Jun 21, 2015 at 08:20:30PM +0100, Charles Bailey wrote:

> I performed some test timings of some different commands on a clone of
> the Linux kernel which was completely packed.

Thanks for timing things. I think we can fairly easily improve a bit on
what you have here. I'll go through my full analysis, but see the
conclusions at the end.

> 	$ time git rev-list --all --objects |
> 		cut -d" " -f1 |
> 		git cat-file --batch-check |
> 		awk '{if ($3 >= 512000) { print $1 }}' |
> 		wc -l
> 	958
> 
> 	real    0m30.823s
> 	user    0m41.904s
> 	sys     0m7.728s
> 
> list-all-objects gives a significant improvement:
> 
> 	$ time git list-all-objects |
> 		git cat-file --batch-check |
> 		awk '{if ($3 >= 512000) { print $1 }}' |
> 		wc -l
> 	958
> 
> 	real    0m9.585s
> 	user    0m10.820s
> 	sys     0m4.960s

That makes sense; of course these two are not necessarily producing the
same answer (they do in your case because it's a fresh clone, and all of
the objects are reachable). I think that's an acceptable caveat.

You can speed up the second one by asking batch-check only for the parts
you care about:

  git list-all-objects |
  git cat-file --batch-check='%(objectsize) %(objectname)' |
  awk '{if ($1 >= 512000) { print $2 }}' |
  wc -l

That dropped my best-of-five timings for the same test down from 9.5s to
7.0s. The answer should be the same. The reason is that cat-file will
only compute the items it needs to show, and the object-type is more
expensive to get than the size[1].

Replacing awk with:

  perl -alne 'print $F[0] if $F[1] > 512000'

dropped that to 6.0s. That mostly means my awk sucks, but it is
interesting to note that not all of the extra time is pipe overhead
inherent to this approach; your choice of processor matters, too.

If you're willing to get a slightly different answer, but one that is
often just as useful, you can replace the "%(objectsize)" in the
cat-file invocation with "%(objectsize:disk)". That gives you the actual
on-disk size of the object, which includes delta and zlib compression.
For 512K, that produces very different results (because files of that
size may actually be text file). But for most truly huge files, they
typically do not delta or compress at all, and the on-disk size is
roughly the same.

That only shaves off 100-200 milliseconds, though.

[1] If you are wondering why the size is cheaper than the type, it is
    because of deltas. For base objects, we can get either immediately
    from the pack entry's header. For a delta, to get the size we have
    to open the object data; the expected size is part of the delta
    data. So we pay the extra cost to zlib-inflate the first few bytes.
    But finding the type works differently; the type in the pack header
    is OFS_DELTA, so we have to walk back to the parent entry to find
    the real type.  If that parent is a delta, we walk back recursively
    until we hit a base object.

    You'd think that would also make %(objectsize:disk) much cheaper
    than %(objectsize), too. But the disk sizes require computing a
    the pack revindex on the fly, which takes a few hundred milliseconds
    on linux.git.

> skipping the cat-filter filter is a lesser but still significant
> improvement:
> 
> 	$ time git list-all-objects -v |
> 		awk '{if ($3 >= 512000) { print $1 }}' |
> 		wc -l
> 	958
> 
> 	real    0m5.637s
> 	user    0m6.652s
> 	sys     0m0.156s

That's a pretty nice improvement over the piped version. But we cannot
do the same custom-format optimization there, because "-v" does not
support it. It would be nice if it supported the full range of cat-file
formatters.

I did a hacky proof-of-concept, and that brought my 6.0s time down to
4.9s.

I also noticed that cat-file doesn't do any output buffering; this is
because it may be used interactively, line by line, by a caller
controlling both pipes. Replacing write_or_die() with fwrite in my
proof-of-concept dropped the time to 3.7s.

That's faster still than your original (different machines, obviously,
but your times are similar to mine):

> The old filter-objects could do the size filter a little be faster, but
> not by much:
> 
> 	$ time git filter-objects --min-size=500k |
> 		wc -l
> 	958
> 
> 	real    0m4.564s
> 	user    0m4.496s
> 	sys     0m0.064s

This is likely caused by your use of sha1_object_info(), which always
computes the type. Switching to the extended form would probably buy you
another 2 seconds or so.

Also, all my numbers are wall-clock times. The CPU time for my 3.7s time
is actually 6.8s. Whereas doing it all in one process would probably
require 3.0s or so of actual CPU time.

So my conclusions are:

  1. Yes, the pipe/parsing overhead of a separate processor really is
     measurable. That's hidden in the wall-clock time if you have
     multiple cores, but you may care more about CPU time. I still think
     the flexibility is worth it.

  2. Cutting out the pipe to cat-file is worth doing, as it saves a few
     seconds. Cutting out "%(objecttype)" saves a lot, too, and is worth
     doing. We should teach "list-all-objects -v" to use cat-file's
     custom formatters (alternatively, we could just teach cat-file a
     "--batch-all-objects" option rather than add a new command).

  3. We should teach cat-file a "--buffer" option to use fwrite. Even if
     we end up with "list-all-objects --format='%(objectsize)'" for this
     task, it would help all the other uses of cat-file.

-Peff

next prev parent reply	other threads:[~2015-06-22  8:35 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-19  9:10 Improvements to parse-options and a new filter-objects command Charles Bailey
2015-06-19  9:10 ` [PATCH 1/3] Correct test-parse-options to handle negative ints Charles Bailey
2015-06-19 18:28   ` Junio C Hamano
2015-06-19  9:10 ` [PATCH 2/3] Move unsigned long option parsing out of pack-objects.c Charles Bailey
2015-06-19 11:03   ` Remi Galan Alfonso
2015-06-19 11:06     ` Charles Bailey
2015-06-19 17:58   ` Junio C Hamano
2015-06-19 18:39     ` Junio C Hamano
2015-06-20 15:31       ` Jakub Narębski
2015-06-19 18:47     ` Jakub Narębski
2015-06-20 16:51     ` Charles Bailey
2015-06-20 17:47       ` Junio C Hamano
2015-06-19  9:10 ` [PATCH 3/3] Add filter-objects command Charles Bailey
2015-06-19 10:10   ` Jeff King
2015-06-19 10:33     ` Charles Bailey
2015-06-19 10:52       ` Jeff King
2015-06-19 18:28         ` Junio C Hamano
2015-06-19 10:52       ` John Keeping
2015-06-19 11:04         ` Charles Bailey
2015-06-21 18:25 ` Improvements to integer option parsing Charles Bailey
2015-06-21 18:25   ` [PATCH 1/2] Correct test-parse-options to handle negative ints Charles Bailey
2015-06-21 18:25   ` [PATCH 2/2] Move unsigned long option parsing out of pack-objects.c Charles Bailey
2015-06-21 18:30     ` Charles Bailey
2015-06-22 22:03       ` Junio C Hamano
2015-06-22 22:08     ` Junio C Hamano
2015-06-22 22:09   ` Improvements to integer option parsing Junio C Hamano
2015-06-22 22:42     ` Charles Bailey
2015-06-21 19:20 ` Fast enumeration of objects Charles Bailey
2015-06-21 19:20   ` [PATCH] Add list-all-objects command Charles Bailey
2015-06-22  8:38     ` Jeff King
2015-06-22 10:33       ` Jeff King
2015-06-22 10:40         ` [PATCH 1/7] for_each_packed_object: automatically open pack index Jeff King
2015-06-22 10:40         ` [PATCH 2/7] cat-file: minor style fix in options list Jeff King
2015-06-22 10:41         ` [PATCH 3/7] cat-file: move batch_options definition to top of file Jeff King
2015-06-22 10:45         ` [PATCH 4/7] cat-file: add --buffer option Jeff King
2015-06-22 10:45         ` [PATCH 5/7] cat-file: stop returning value from batch_one_object Jeff King
2015-06-22 10:45         ` [PATCH 6/7] cat-file: split batch_one_object into two stages Jeff King
2015-06-22 10:45         ` [PATCH 7/7] cat-file: add --batch-all-objects option Jeff King
2015-06-26  6:56           ` Eric Sunshine
2015-06-26 15:48             ` Jeff King
2015-06-22 11:06         ` [PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects Jeff King
2015-06-22 22:03           ` Charles Bailey
2015-06-22 23:46             ` Jeff King
2015-06-22 21:48         ` [PATCH] Add list-all-objects command Charles Bailey
2015-06-22 21:50         ` Junio C Hamano
2015-06-22 23:50           ` Jeff King
2015-06-22 11:38       ` Charles Bailey
2015-06-22  9:57     ` Duy Nguyen
2015-06-22 10:24       ` Jeff King
2015-06-22  8:35   ` Jeff King [this message]
2015-06-22 19:44     ` Fast enumeration of objects Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150622083543.GA12259@peff.net \
    --to=peff@peff.net \
    --cc=charles@hashpling.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).