* Unpredictable peak memory usage when using `git log` command
@ 2024-08-30 12:20 Yuri Karnilaev
2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King
2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King
0 siblings, 2 replies; 6+ messages in thread
From: Yuri Karnilaev @ 2024-08-30 12:20 UTC (permalink / raw)
To: git
Hello,
I encountered an issue when using the `git log` command to retrieve commits in large repositories. My task is to iterate over all commits and output them in a specific format. However, my computer has limited memory, so I am looking for a way to reduce the memory consumption of this operation.
I tested two different commands on the `torvalds/linux` repository as an example of a large repository and noticed a significant difference in peak memory usage:
1. Processing all commits in one go:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 --numstat > 1.txt
```
Result:
```
real 594,01
user 562,22
sys 12,43
7407976448 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
187437 page reclaims
274228 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1031 voluntary context switches
287056 involuntary context switches
5455479398547 instructions retired
1828253079874 cycles elapsed
135_616_064 peak memory footprint
```
2. Processing commits in batches:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
```
Result:
```
real 9,83
user 7,48
sys 0,40
2390540288 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
93487 page reclaims
52995 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
634 voluntary context switches
14183 involuntary context switches
50173495540 instructions retired
24906960156 cycles elapsed
1_470_935_680 peak memory footprint
```
As you can see from the results, the peak memory usage when processing commits in batches is 10 times higher than when processing all commits in one go.
Can you please explain why this happens? Is there a way to work around this? Or maybe can you fix this in future Git versions?
Operating System: Mac OS 14.6.1 (23G93)
Git Version: 2.39.3 (Apple Git-146)
Best regards,
Yuri
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH] revision: free commit buffers for skipped commits 2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev @ 2024-08-30 20:53 ` Jeff King 2024-08-30 21:27 ` Junio C Hamano 2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King 1 sibling, 1 reply; 6+ messages in thread From: Jeff King @ 2024-08-30 20:53 UTC (permalink / raw) To: Yuri Karnilaev; +Cc: git On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote: > As you can see from the results, the peak memory usage when processing > commits in batches is 10 times higher than when processing all commits > in one go. > Can you please explain why this happens? Is there a way to work around > this? Or maybe can you fix this in future Git versions? Try this: -- >8 -- Subject: [PATCH] revision: free commit buffers for skipped commits In git-log we leave the save_commit_buffer flag set to "1", which tells the commit parsing code to store the object content after it has parsed it to find parents, tree, etc. That lets us reuse the contents for pretty-printing the commit in the output. And then after printing each commit, we call free_commit_buffer(), since we don't need it anymore. But some options may cause us to traverse commits which are not part of the output. And so git-log does not see them at all, and doesn't free them. One such case is something like: git log -n 1000 --skip=1000000 which will churn through a million commits, before showing only a thousand. We loop through these inside get_revision(), without freeing the contents. As a result, we end up storing the object data for those million commits simultaneously. We should free the stored buffers (if any) for those commits as we skip over them, which is what this patch does. Running the above command in linux.git drops the peak heap usage from ~1.1GB to ~200MB, according to valgrind/massif. (I thought we might get an even bigger improvement, but the remaining memory is going to commit/tree structs, which we do hold on to forever). Note that this problem doesn't occur if: - you're running a git-rev-list without a --format parameter; it turns off save_commit_buffer by default, since it only output the object id - you've built a commit-graph file, since in that case we'd use the optimized graph data instead of the initial parse, and then do a lazy parse for commits we're actually going to output There are probably some other option combinations that can likewise end up with useless stored commit buffers. For example, if you ask for "foo..bar", then we'll have to walk down to the merge base, and everything on the "foo" side won't be shown. Tuning the "save" behavior to handle that might be tricky (I guess maybe drop buffers for anything we mark as UNINTERESTING?). And in the long run, the right solution here is probably to make sure the commit-graph is built (since it fixes the memory problem _and_ drastically reduces CPU usage). But since this "--skip" case is an easy one-liner, it's worth fixing in the meantime. It should be OK to make this call even if there is no saved buffer (e.g., because save_commit_buffer=0, or because a commit-graph was used), since it's O(1) to look up the buffer and is a noop if it isn't present. I verified by running the above command after "git commit-graph write --reachable", and it takes the same time with and without this patch. Reported-by: Yuri Karnilaev <karnilaev@gmail.com> Signed-off-by: Jeff King <peff@peff.net> --- revision.c | 1 + 1 file changed, 1 insertion(+) diff --git a/revision.c b/revision.c index ac94f8d429..2d7ad2bddf 100644 --- a/revision.c +++ b/revision.c @@ -4407,6 +4407,7 @@ static struct commit *get_revision_internal(struct rev_info *revs) c = get_revision_1(revs); if (!c) break; + free_commit_buffer(revs->repo->parsed_objects, c); } } -- 2.46.0.769.g1b22d789e3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] revision: free commit buffers for skipped commits 2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King @ 2024-08-30 21:27 ` Junio C Hamano 0 siblings, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2024-08-30 21:27 UTC (permalink / raw) To: Jeff King; +Cc: Yuri Karnilaev, git Jeff King <peff@peff.net> writes: > But since this "--skip" case is an easy one-liner, it's worth fixing in > the meantime. OK. > diff --git a/revision.c b/revision.c > index ac94f8d429..2d7ad2bddf 100644 > --- a/revision.c > +++ b/revision.c > @@ -4407,6 +4407,7 @@ static struct commit *get_revision_internal(struct rev_info *revs) > c = get_revision_1(revs); > if (!c) > break; > + free_commit_buffer(revs->repo->parsed_objects, c); > } Even if we freed the buffer and then later need it, we'd read the buffer again anyway, so this is a safe thing to do. And because commits skipped in this separate loop will _never_ be given to the caller of get_revision(), this it a reasonable optimization, too. Will queue. Thanks. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Unpredictable peak memory usage when using `git log` command 2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev 2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King @ 2024-08-30 21:06 ` Jeff King 2024-08-31 10:24 ` Yuri Karnilaev 2024-09-02 13:08 ` Patrick Steinhardt 1 sibling, 2 replies; 6+ messages in thread From: Jeff King @ 2024-08-30 21:06 UTC (permalink / raw) To: Yuri Karnilaev; +Cc: git On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote: > 2. Processing commits in batches: > ``` > /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt > ``` > [...] > Operating System: Mac OS 14.6.1 (23G93) > Git Version: 2.39.3 (Apple Git-146) I sent a patch which I think should make things better for you, but I wanted to mention two things in a more general way: 1. You should really consider building a commit-graph file with "git commit-graph write --reachable". That will reduce the memory usage for this case, but also improve the CPU quite a bit (we won't have to open those million skipped commits to chase their parent pointers). I haven't kept up with the defaults for writing graph files. I thought gc.writeCommitGraph defaults to "true" these days, though that wouldn't help in a freshly cloned repository (arguably we should write the commit graph on clone?). 2. Using "--skip" still has to traverse all of those intermediate commits. So it's effectively quadratic in the number of commits overall (you end up skipping the first 1000 over and over). It's been a while since I've had to "paginate" segments of history like this, but a better solution is along the lines of: - use "-n 1000" to get 1000 commits in each chunk - use "--boundary" to report the commits that were queued to be traversed next but weren't shown - in invocations after the first one, start the traversal at those boundary commits, rather than HEAD You'll probably need to add "%m" to your format to show the boundaries (or alternatively, you can do the commit selection with rev-list, and then output the result to "log --no-walk --stdin" to do the pretty-printing). -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Unpredictable peak memory usage when using `git log` command 2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King @ 2024-08-31 10:24 ` Yuri Karnilaev 2024-09-02 13:08 ` Patrick Steinhardt 1 sibling, 0 replies; 6+ messages in thread From: Yuri Karnilaev @ 2024-08-31 10:24 UTC (permalink / raw) To: Jeff King; +Cc: git Thanks, Peff! I will try the recommendations for optimizing memory consumption for my task, that you mentioned. Have a nice day, Yuri > On 31. Aug 2024, at 0.06, Jeff King <peff@peff.net> wrote: > > On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote: > >> 2. Processing commits in batches: >> ``` >> /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt >> ``` >> [...] >> Operating System: Mac OS 14.6.1 (23G93) >> Git Version: 2.39.3 (Apple Git-146) > > I sent a patch which I think should make things better for you, but I > wanted to mention two things in a more general way: > > 1. You should really consider building a commit-graph file with "git > commit-graph write --reachable". That will reduce the memory usage > for this case, but also improve the CPU quite a bit (we won't have > to open those million skipped commits to chase their parent > pointers). > > I haven't kept up with the defaults for writing graph files. I > thought gc.writeCommitGraph defaults to "true" these days, though > that wouldn't help in a freshly cloned repository (arguably we > should write the commit graph on clone?). > > 2. Using "--skip" still has to traverse all of those intermediate > commits. So it's effectively quadratic in the number of commits > overall (you end up skipping the first 1000 over and over). > > It's been a while since I've had to "paginate" segments of history > like this, but a better solution is along the lines of: > > - use "-n 1000" to get 1000 commits in each chunk > > - use "--boundary" to report the commits that were queued to be > traversed next but weren't shown > > - in invocations after the first one, start the traversal at > those boundary commits, rather than HEAD > > You'll probably need to add "%m" to your format to show the > boundaries (or alternatively, you can do the commit selection with > rev-list, and then output the result to "log --no-walk --stdin" to > do the pretty-printing). > > -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Unpredictable peak memory usage when using `git log` command 2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King 2024-08-31 10:24 ` Yuri Karnilaev @ 2024-09-02 13:08 ` Patrick Steinhardt 1 sibling, 0 replies; 6+ messages in thread From: Patrick Steinhardt @ 2024-09-02 13:08 UTC (permalink / raw) To: Jeff King; +Cc: Yuri Karnilaev, git On Fri, Aug 30, 2024 at 05:06:07PM -0400, Jeff King wrote: > On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote: > > > 2. Processing commits in batches: > > ``` > > /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt > > ``` > > [...] > > Operating System: Mac OS 14.6.1 (23G93) > > Git Version: 2.39.3 (Apple Git-146) > > I sent a patch which I think should make things better for you, but I > wanted to mention two things in a more general way: > > 1. You should really consider building a commit-graph file with "git > commit-graph write --reachable". That will reduce the memory usage > for this case, but also improve the CPU quite a bit (we won't have > to open those million skipped commits to chase their parent > pointers). > > I haven't kept up with the defaults for writing graph files. I > thought gc.writeCommitGraph defaults to "true" these days, though > that wouldn't help in a freshly cloned repository (arguably we > should write the commit graph on clone?). It does default to true indeed. There is also an option to write commit graph on fetch via "fetch.writeCommitGraph", but that setting is set to false by default. To the best of my knowledge there is no option to generate on a clone, but I agree that it would be sensible to have such a thing. Patrick ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-09-02 13:08 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-30 12:20 Unpredictable peak memory usage when using `git log` command Yuri Karnilaev 2024-08-30 20:53 ` [PATCH] revision: free commit buffers for skipped commits Jeff King 2024-08-30 21:27 ` Junio C Hamano 2024-08-30 21:06 ` Unpredictable peak memory usage when using `git log` command Jeff King 2024-08-31 10:24 ` Yuri Karnilaev 2024-09-02 13:08 ` Patrick Steinhardt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).