git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Olaf Hering <olaf@aepfle.de>
Subject: Re: [PATCH] ref-filter: disable save_commit_buffer while traversing
Date: Mon, 11 Jul 2022 17:12:37 +0200	[thread overview]
Message-ID: <220711.8635f77j7s.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <Ysw4JtoHW1vWmqhz@coredump.intra.peff.net>


On Mon, Jul 11 2022, Jeff King wrote:

> Various ref-filter options like "--contains" or "--merged" may cause us
> to traverse large segments of the history graph. It's counter-productive
> to have save_commit_buffer turned on, as that will instruct the commit
> code to cache in-memory the object contents for each commit we traverse.
>
> This increases the amount of heap memory used while providing little or
> no benefit, since we're not actually planning to display those commits
> (which is the usual reason that tools like git-log want to keep them
> around). We can easily disable this feature while ref-filter is running.
> This lowers peak heap (as measured by massif) for running:
>
>   git tag --contains 1da177e4c3
>
> in linux.git from ~100MB to ~20MB. It also seems to improve runtime by
> 4-5% (600ms vs 630ms).
>
> A few points to note:
>
>   - it should be safe to temporarily disable save_commit_buffer like
>     this. The saved buffers are accessed through get_commit_buffer(),
>     which treats the saved ones like a cache, and loads on-demand from
>     the object database on a cache miss. So any code that was using this
>     would not be wrong, it might just incur an extra object lookup for
>     some objects. But...
>
>   - I don't think any ref-filter related code is using the cache. While
>     it's true that an option like "--format=%(*contents:subject)" or
>     "--sort=*authordate" will need to look at the commit contents,
>     ref-filter doesn't use get_commit_buffer() to do so! It always reads
>     the objects directly via read_object_file(), though it does avoid
>     re-reading objects if the format can be satisfied without them.
>
>     Timing "git tag --format=%(*authordate)" shows that we're the same
>     before and after, as expected.

Hrm, so for doing the format we're leaving some performance on the table
as we're currently not making use of this cache, so this makes nothing
worse on that front.

But doesn't this approach then also close the door on using the same
cache for performance improvements in that area? I.e. spotting that
we've already parsed that commit, so we can get it from the cache?

B.t.w. did you try to benchmark this with --no-contains too, I tried e.g.:

    ./git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"

Which gives me:

	$ git hyperfine -L rev HEAD~1,HEAD -s 'make CFLAGS=-O3' './git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"' -w 1 
	Benchmark 1: ./git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"' in 'HEAD~1
	  Time (mean ± σ):      1.437 s ±  0.107 s    [User: 1.252 s, System: 0.082 s]
	  Range (min … max):    1.306 s …  1.653 s    10 runs
	 
	Benchmark 2: ./git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"' in 'HEAD
	  Time (mean ± σ):      1.335 s ±  0.044 s    [User: 1.230 s, System: 0.050 s]
	  Range (min … max):    1.260 s …  1.417 s    10 runs
	 
	Summary
	  './git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"' in 'HEAD' ran
	    1.08 ± 0.09 times faster than './git -P tag --contains 88ce3ef636b --no-contains a39b4003f0e -- "v*"' in 'HEAD~1'
	
Whereas just --contains shows the benefit you're noting:
	
	$ git hyperfine -L rev HEAD~1,HEAD -s 'make CFLAGS=-O3' './git -P tag --contains 88ce3ef636b -- "v*"' -w 1 
	Benchmark 1: ./git -P tag --contains 88ce3ef636b -- "v*"' in 'HEAD~1
	  Time (mean ± σ):      1.068 s ±  0.102 s    [User: 0.886 s, System: 0.068 s]
	  Range (min … max):    0.889 s …  1.272 s    10 runs
	 
	Benchmark 2: ./git -P tag --contains 88ce3ef636b -- "v*"' in 'HEAD
	  Time (mean ± σ):     931.6 ms ±  39.9 ms    [User: 865.3 ms, System: 34.3 ms]
	  Range (min … max):   880.5 ms … 990.1 ms    10 runs
	 
	Summary
	  './git -P tag --contains 88ce3ef636b -- "v*"' in 'HEAD' ran
	    1.15 ± 0.12 times faster than './git -P tag --contains 88ce3ef636b -- "v*"' in 'HEAD~1'

But this is against git.git on a loaded system, so maybe it means
nothing...

  reply	other threads:[~2022-07-11 15:38 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-11 14:48 [PATCH] ref-filter: disable save_commit_buffer while traversing Jeff King
2022-07-11 15:12 ` Ævar Arnfjörð Bjarmason [this message]
2022-07-11 17:47   ` Jeff King
2022-07-11 21:27 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220711.8635f77j7s.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=olaf@aepfle.de \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).