Unpredictable peak memory usage when using `git log` command

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unpredictable peak memory usage when using `git log` command
@ 2024-08-30 12:20 Yuri Karnilaev
  2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King
  2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Yuri Karnilaev @ 2024-08-30 12:20 UTC (permalink / raw)
  To: git

Hello,

I encountered an issue when using the `git log` command to retrieve commits in large repositories. My task is to iterate over all commits and output them in a specific format. However, my computer has limited memory, so I am looking for a way to reduce the memory consumption of this operation.

I tested two different commands on the `torvalds/linux` repository as an example of a large repository and noticed a significant difference in peak memory usage:

1. Processing all commits in one go:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 --numstat > 1.txt
```
Result:
```
real 594,01
user 562,22
sys 12,43
          7407976448  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              187437  page reclaims
              274228  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                1031  voluntary context switches
              287056  involuntary context switches
       5455479398547  instructions retired
       1828253079874  cycles elapsed
           135_616_064  peak memory footprint
```

2. Processing commits in batches:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
```
Result:
```
real 9,83
user 7,48
sys 0,40
          2390540288  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               93487  page reclaims
               52995  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                 634  voluntary context switches
               14183  involuntary context switches
         50173495540  instructions retired
         24906960156  cycles elapsed
          1_470_935_680  peak memory footprint
```

As you can see from the results, the peak memory usage when processing commits in batches is 10 times higher than when processing all commits in one go.
Can you please explain why this happens? Is there a way to work around this? Or maybe can you fix this in future Git versions?

Operating System: Mac OS 14.6.1 (23G93)
Git Version: 2.39.3 (Apple Git-146)

Best regards,
Yuri

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] revision: free commit buffers for skipped commits
  2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev
@ 2024-08-30 20:53 ` Jeff King
  2024-08-30 21:27   ` Junio C Hamano
  2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2024-08-30 20:53 UTC (permalink / raw)
  To: Yuri Karnilaev; +Cc: git

On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote:

> As you can see from the results, the peak memory usage when processing
> commits in batches is 10 times higher than when processing all commits
> in one go.
> Can you please explain why this happens? Is there a way to work around
> this? Or maybe can you fix this in future Git versions?

Try this:

-- >8 --
Subject: [PATCH] revision: free commit buffers for skipped commits

In git-log we leave the save_commit_buffer flag set to "1", which tells
the commit parsing code to store the object content after it has parsed
it to find parents, tree, etc. That lets us reuse the contents for
pretty-printing the commit in the output. And then after printing each
commit, we call free_commit_buffer(), since we don't need it anymore.

But some options may cause us to traverse commits which are not part of
the output. And so git-log does not see them at all, and doesn't free
them. One such case is something like:

  git log -n 1000 --skip=1000000

which will churn through a million commits, before showing only a
thousand. We loop through these inside get_revision(), without freeing
the contents. As a result, we end up storing the object data for those
million commits simultaneously.

We should free the stored buffers (if any) for those commits as we skip
over them, which is what this patch does. Running the above command in
linux.git drops the peak heap usage from ~1.1GB to ~200MB, according to
valgrind/massif. (I thought we might get an even bigger improvement, but
the remaining memory is going to commit/tree structs, which we do hold
on to forever).

Note that this problem doesn't occur if:

  - you're running a git-rev-list without a --format parameter; it turns
    off save_commit_buffer by default, since it only output the object
    id

  - you've built a commit-graph file, since in that case we'd use the
    optimized graph data instead of the initial parse, and then do a
    lazy parse for commits we're actually going to output

There are probably some other option combinations that can likewise
end up with useless stored commit buffers. For example, if you ask for
"foo..bar", then we'll have to walk down to the merge base, and
everything on the "foo" side won't be shown. Tuning the "save" behavior
to handle that might be tricky (I guess maybe drop buffers for anything
we mark as UNINTERESTING?). And in the long run, the right solution here
is probably to make sure the commit-graph is built (since it fixes the
memory problem _and_ drastically reduces CPU usage).

But since this "--skip" case is an easy one-liner, it's worth fixing in
the meantime. It should be OK to make this call even if there is no
saved buffer (e.g., because save_commit_buffer=0, or because a
commit-graph was used), since it's O(1) to look up the buffer and is a
noop if it isn't present. I verified by running the above command after
"git commit-graph write --reachable", and it takes the same time with
and without this patch.

Reported-by: Yuri Karnilaev <karnilaev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 revision.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/revision.c b/revision.c
index ac94f8d429..2d7ad2bddf 100644
--- a/revision.c
+++ b/revision.c
@@ -4407,6 +4407,7 @@ static struct commit *get_revision_internal(struct rev_info *revs)
 				c = get_revision_1(revs);
 				if (!c)
 					break;
+				free_commit_buffer(revs->repo->parsed_objects, c);
 			}
 		}

-- 
2.46.0.769.g1b22d789e3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] revision: free commit buffers for skipped commits
  2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King
@ 2024-08-30 21:27   ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2024-08-30 21:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Yuri Karnilaev, git

Jeff King <peff@peff.net> writes:

> But since this "--skip" case is an easy one-liner, it's worth fixing in
> the meantime.

OK.

> diff --git a/revision.c b/revision.c
> index ac94f8d429..2d7ad2bddf 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -4407,6 +4407,7 @@ static struct commit *get_revision_internal(struct rev_info *revs)
>  				c = get_revision_1(revs);
>  				if (!c)
>  					break;
> +				free_commit_buffer(revs->repo->parsed_objects, c);
>  			}

Even if we freed the buffer and then later need it, we'd read the
buffer again anyway, so this is a safe thing to do.  And because
commits skipped in this separate loop will _never_ be given to the
caller of get_revision(), this it a reasonable optimization, too.

Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unpredictable peak memory usage when using `git log` command
  2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev
  2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King
@ 2024-08-30 21:06 ` Jeff King
  2024-08-31 10:24   ` Yuri Karnilaev
  2024-09-02 13:08   ` Patrick Steinhardt
  1 sibling, 2 replies; 6+ messages in thread
From: Jeff King @ 2024-08-30 21:06 UTC (permalink / raw)
  To: Yuri Karnilaev; +Cc: git

On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote:

> 2. Processing commits in batches:
> ```
> /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
> ```
> [...]
> Operating System: Mac OS 14.6.1 (23G93)
> Git Version: 2.39.3 (Apple Git-146)

I sent a patch which I think should make things better for you, but I
wanted to mention two things in a more general way:

  1. You should really consider building a commit-graph file with "git
     commit-graph write --reachable". That will reduce the memory usage
     for this case, but also improve the CPU quite a bit (we won't have
     to open those million skipped commits to chase their parent
     pointers).

     I haven't kept up with the defaults for writing graph files. I
     thought gc.writeCommitGraph defaults to "true" these days, though
     that wouldn't help in a freshly cloned repository (arguably we
     should write the commit graph on clone?).

  2. Using "--skip" still has to traverse all of those intermediate
     commits. So it's effectively quadratic in the number of commits
     overall (you end up skipping the first 1000 over and over).

     It's been a while since I've had to "paginate" segments of history
     like this, but a better solution is along the lines of:

       - use "-n 1000" to get 1000 commits in each chunk

       - use "--boundary" to report the commits that were queued to be
	 traversed next but weren't shown

       - in invocations after the first one, start the traversal at
	 those boundary commits, rather than HEAD

     You'll probably need to add "%m" to your format to show the
     boundaries (or alternatively, you can do the commit selection with
     rev-list, and then output the result to "log --no-walk --stdin" to
     do the pretty-printing).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unpredictable peak memory usage when using `git log` command
  2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
@ 2024-08-31 10:24   ` Yuri Karnilaev
  2024-09-02 13:08   ` Patrick Steinhardt
  1 sibling, 0 replies; 6+ messages in thread
From: Yuri Karnilaev @ 2024-08-31 10:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Thanks, Peff!

I will try the recommendations for optimizing memory consumption for my task, that you mentioned.

Have a nice day,
Yuri

> On 31. Aug 2024, at 0.06, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote:
> 
>> 2. Processing commits in batches:
>> ```
>> /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
>> ```
>> [...]
>> Operating System: Mac OS 14.6.1 (23G93)
>> Git Version: 2.39.3 (Apple Git-146)
> 
> I sent a patch which I think should make things better for you, but I
> wanted to mention two things in a more general way:
> 
>  1. You should really consider building a commit-graph file with "git
>     commit-graph write --reachable". That will reduce the memory usage
>     for this case, but also improve the CPU quite a bit (we won't have
>     to open those million skipped commits to chase their parent
>     pointers).
> 
>     I haven't kept up with the defaults for writing graph files. I
>     thought gc.writeCommitGraph defaults to "true" these days, though
>     that wouldn't help in a freshly cloned repository (arguably we
>     should write the commit graph on clone?).
> 
>  2. Using "--skip" still has to traverse all of those intermediate
>     commits. So it's effectively quadratic in the number of commits
>     overall (you end up skipping the first 1000 over and over).
> 
>     It's been a while since I've had to "paginate" segments of history
>     like this, but a better solution is along the lines of:
> 
>       - use "-n 1000" to get 1000 commits in each chunk
> 
>       - use "--boundary" to report the commits that were queued to be
> 	 traversed next but weren't shown
> 
>       - in invocations after the first one, start the traversal at
> 	 those boundary commits, rather than HEAD
> 
>     You'll probably need to add "%m" to your format to show the
>     boundaries (or alternatively, you can do the commit selection with
>     rev-list, and then output the result to "log --no-walk --stdin" to
>     do the pretty-printing).
> 
> -Peff


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unpredictable peak memory usage when using `git log` command
  2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
  2024-08-31 10:24   ` Yuri Karnilaev
@ 2024-09-02 13:08   ` Patrick Steinhardt
  1 sibling, 0 replies; 6+ messages in thread
From: Patrick Steinhardt @ 2024-09-02 13:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Yuri Karnilaev, git

On Fri, Aug 30, 2024 at 05:06:07PM -0400, Jeff King wrote:
> On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote:
> 
> > 2. Processing commits in batches:
> > ```
> > /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
> > ```
> > [...]
> > Operating System: Mac OS 14.6.1 (23G93)
> > Git Version: 2.39.3 (Apple Git-146)
> 
> I sent a patch which I think should make things better for you, but I
> wanted to mention two things in a more general way:
> 
>   1. You should really consider building a commit-graph file with "git
>      commit-graph write --reachable". That will reduce the memory usage
>      for this case, but also improve the CPU quite a bit (we won't have
>      to open those million skipped commits to chase their parent
>      pointers).
> 
>      I haven't kept up with the defaults for writing graph files. I
>      thought gc.writeCommitGraph defaults to "true" these days, though
>      that wouldn't help in a freshly cloned repository (arguably we
>      should write the commit graph on clone?).

It does default to true indeed. There is also an option to write commit
graph on fetch via "fetch.writeCommitGraph", but that setting is set to
false by default. To the best of my knowledge there is no option to
generate on a clone, but I agree that it would be sensible to have such
a thing.

Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-09-02 13:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev
2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King
2024-08-30 21:27   ` Junio C Hamano
2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
2024-08-31 10:24   ` Yuri Karnilaev
2024-09-02 13:08   ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).