git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Junio C Hamano <junkio@cox.net>, spearce@spearce.org
Subject: [PATCH 0/3] fast-import slow on large directories
Date: Sat, 10 Mar 2007 14:15:15 -0500	[thread overview]
Message-ID: <20070310191515.GA3416@coredump.intra.peff.net> (raw)

The short story:

I wanted to import into git a dataset consisting of a single directory
with 300,000 files. I tried using git-fast-import, but it wasn't able to
handle the large directory size. This patchset optimizes the algorithms
used for tree handling, and I get orders of magnitude improvements in
memory and CPU consumption.

The patches are (see commit messages for more explanation):

  1. grow tree storage more aggressively
  2. code rearranging to make patch 3 easier to read
  3. keep tree entries sorted and use binary instead of linear searches

The long story, with numbers:

Originally I just tried git-fast-import from 'next'. It built the pack
file (about 65M) from the blobs after a few minutes, and then while
building the commit, consumed all system memory (about 1G) and crashed.
The culprit was the constant increase in allocation as the tree size
grew, coupled with failure to pass allocated pool memory back to the OS.
Patch 1 doubles the allocated size each time we run out of space.

With patch 1, the memory usage was much more reasonable (it ends up
using about 46M). However, the process still ran for over an hour before
I killed it (bear in mind that doing deltas on all of the blobs takes
about 5 minutes). The culprit this time was the linear search through
the tree entries looking to see if each 'M' line was a new entry or an
update. Patch 3 turns this into a binary search.

To do some testing, I cut my original dataset down to 20,000 entries,
which I could feasibly do with the stock git-fast-import. Here are the
numbers:

For reference, just adding the blobs using stock git-fast-import without
making a commit (the memory report is the "Memory total" from gfi):
mem: 2673 KiB
5.86user 3.67system 0:09.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+405366minor)pagefaults 0swaps

Now here's stock git-fast-import making the commit (note the memory)
mem: 101992 KiB
37.07user 4.15system 0:41.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+430469minor)pagefaults 0swaps

Now here's with just patch 1 (better memory, but still slow):
mem: 3688 KiB
30.00user 3.73system 0:34.80elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+406064minor)pagefaults 0swaps

And with patches 1, 2, and 3:
mem: 3688 KiB
6.08user 3.71system 0:10.10elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+406064minor)pagefaults 0swaps

And my final 300,000 item dataset with patches 1, 2, and 3:
mem: 46378 KiB
414.17user 69.82system 8:11.92elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7730960minor)pagefaults 0swaps


Yes, this dataset is pathological. But I suspect the speed improvements
will help even modest projects a little, and almost certainly not hurt
(the aggressive memory growth will probably waste a bit more memory).

-Peff

                 reply	other threads:[~2007-03-10 19:15 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070310191515.GA3416@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).