git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: [PATCH 0/2] diffcore-break optimizations
Date: Mon, 16 Nov 2009 10:53:32 -0500	[thread overview]
Message-ID: <20091116155331.GA30719@coredump.intra.peff.net> (raw)

On one of my more ridiculously gigantic repositories, I recently tried
to make a commit that ran git out of memory while trying to commit. The
repository has about 3 gigabytes of data, and I made a small-ish change
to every file. Pathological, yes, but I think we can do better than
chugging for 5 minutes and dying.

The culprit turned out to be memory usage in diffcore-break, which is on
by default for "git status" (and for the "git commit" template message).
It wants to have every changed blob in memory at once, which is just
silly.

The patches are:

  [1/2]: diffcore-break: free filespec data as we go

  This addresses the memory consumption issue. If you have enough
  memory, it doesn't actually yield a speed improvement, but nor does it
  show any slowdown for practical workloads.

  There is a theoretical slowdown when doing -B -M, because the rename
  phase has to re-fetch the blobs from the object store. However, I
  wasn't able to measure any slowdown for real-world cases (like "git
  log --summary -M -B >/dev/null" on git.git).

  I did manage to produce the slowdown on a pathological case: ten
  20-megabyte files, each copied with a slight modification to another
  file, and then replaced with totally different contents (so each one
  will be broken and then trigger an inexact rename). That diff went
  from 16s to 17s.

  But I improved that and more with the next optimization.

  [2/2]: diffcore-break: save cnt_data for other phases

  We already do this in rename detection, and since they use the same
  data format, there is little reason not to do so. My pathological case
  above went from 17s down to 12s. I wasn't able to detect any speedup
  or slowdown for sane cases.

  So I doubt anybody will even notice this, but I think since we can
  address pathological cases, we might as well (and as you will see, the
  code change is quite small).

All of that being said, I was able to do my commit, but I still had to
wait five minutes for it to chug through 3G of data. :) I am tempted to
add a "quick" mode to git-commit, but perhaps such a ridiculous case is
rare enough not to worry about. I worked around it by writing my commit
message separately and using "git commit -F".

-Peff

             reply	other threads:[~2009-11-16 15:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-16 15:53 Jeff King [this message]
2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
2009-11-16 21:20   ` Junio C Hamano
2009-11-19 15:22     ` Jeff King
2009-11-20  7:32       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091116155331.GA30719@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).