Re: Debugging git-commit slowness on a large repo

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Joshua Redstone <joshua.redstone@fb.com>
To: "Nguyen Thai Ngoc Duy" <pclouds@gmail.com>,
	"Carlos Martín Nieto" <cmn@elego.de>,
	"Tomas Carnecky" <tom@dbservice.com>,
	"Junio C Hamano" <gitster@pobox.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Debugging git-commit slowness on a large repo
Date: Tue, 20 Dec 2011 00:51:16 +0000	[thread overview]
Message-ID: <CB1518AB.2D649%joshua.redstone@fb.com> (raw)
In-Reply-To: <CB0BCE02.2CD42%joshua.redstone@fb.com>

I've managed to speed up git-commit on large repos by 4x by removing some
safeguards that caused git to stat every file in the repo on commits that
touch a small number of files.  The diff, for illustrative purposes only,
is at:

    https://gist.github.com/1499621


With a repo with 1 million files (but few commits), the diff drops the
commit time down from 7.3 seconds to 1.8 seconds, a 75% decrease. The
optimizations are:

1. Remove call to refresh_cache_or_die that stats every file in the repo,
i think the purpose is to detect any changes between git-add and
git-commit.

2. Pass missing_ok=true to cache_tree_update. This causes the tree
generation code to not stat every file in the repo to verify it still
exists as a git object.

3. Remove pair discard_cache/read_cache_from, which rereads the index
file. I think this was in case a pre-commit hook changed the set of things
being committed.

It may be worth making some of these flag-enabled.



Josh


On 12/12/11 4:15 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote:

>Sorry for the poor formatting of the stack trace.
>
>I've written two scripts to reproduce the slow commit behavior that I see.
> I've posted both to:
>   https://gist.github.com/1469760
>
>To repro, first create a dir with lots of files (it defaults to creating 1
>million files in 1000 dirs):
>
>$ loadGen.py --baseDir=./bigdir
>
>then, run the simulator scripts to generate and commit a series of small
>changes to the repo:
>
>$ git reset --hard HEAD && simulate.py ./bigdir git
>
>The git reset is to clean up any cruft left over from a previous partial
>invocation of simulate.py
>
>Note that loadGen.py defaults to creating 1 million files and committing
>them in one commit.  With a flash drive this took < 30 min, and subsequent
>small commits in simulate.py took about 6 seconds.  With a hard-drive,
>it's taking > 1hr (still waiting for it to finish).
>
>Cheers,
>Josh
>
>
>On 12/8/11 4:17 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote:
>
>>Btw, I also tried doing some very poor-man's profiling on "git commit"
>>without any of the readtree/writetree/updateindex commands.
>>
>>Around 50% of the time was in (bottom few frames may have varied)
>>
>>#1  0x00000000004c467e in find_pack_entry (sha1=0x1475a44 ,
>>e=0x7fff2621f070) at sha1_file.c:2027
>>#2  0x00000000004c57b0 in has_sha1_file (sha1=0x7fe2cd9c7900 "00") at
>>sha1_file.c:2567 
>>                 
>>                 
>>#3  0x000000000046e4af in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:333       
>>                 
>>                 
>>            
>>#4  0x000000000046e278 in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:285       
>>                 
>>                 
>>            
>>#5  0x000000000046e278 in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:285       
>>                 
>>                 
>>            
>>#6  0x000000000046e278 in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:285       
>>                 
>>                 
>>            
>>#7  0x000000000046e278 in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:285       
>>                 
>>                 
>>            
>>#8  0x000000000046e278 in update_one (it=<value optimized out>,
>>cache=<value optimized out>, entries=<value optimized out>, base=<value
>>optimized out>, baselen=<value optimized out>, missing_ok=<value
>>optimized
>>out>, dryrun=0) at cache-\
>>tree.c:285       
>>                 
>>                 
>>            
>>#9  0x000000000046e869 in cache_tree_update (it=<value optimized out>,
>>cache=<value optimized out>, entries=dwarf2_read_address: Corrupted DWARF
>>expression.      
>>                 
>>) at cache-tree.c:379
>>                 
>>                 
>>            
>>#10 0x000000000041cade in prepare_to_commit (index_file=0x781740
>>".git/index", prefix=<value optimized out>, current_head=<value optimized
>>out>, s=0x7fff26220d00, author_ident=<value optimized out>) at
>>builtin/commit.c:866
>>#11 0x000000000041d891 in cmd_commit (argc=0, argv=0x7fff262213a0,
>>prefix=0x0) at builtin/commit.c:1407
>>                 
>>                 
>>#12 0x0000000000404bf7 in handle_internal_command (argc=4,
>>argv=0x7fff262213a0) at git.c:308
>>                 
>>                 
>>#13 0x0000000000404e2f in main (argc=4, argv=0x7fff262213a0) at git.c:512
>>                 
>>                 
>>            
>> 
>>
>>
>>And 30% of the time was in:
>>
>>#0  0x00000034af2c34a5 in _lxstat () from /lib64/libc.so.6
>>                 
>>                 
>>            
>>#1  0x00000000004abe0f in refresh_cache_ent (istate=0x780940,
>>ce=0x7f8462a34e40, options=0, err=0x7fff6dd9f588) at
>>/usr/include/sys/stat.h:443
>>                 
>>#2  0x00000000004ac1a0 in refresh_index (istate=0x780940, flags=<value
>>optimized out>, pathspec=<value optimized out>, seen=<value optimized
>>out>, header_msg=0x0) at read-cache.c:1133
>>                 
>>#3  0x000000000041b60a in refresh_cache_or_die (refresh_flags=<value
>>optimized out>) at builtin/commit.c:331
>>                 
>>                 
>>#4  0x000000000041bc39 in prepare_index (argc=0, argv=0x7fff6dda0310,
>>prefix=0x0, current_head=<value optimized out>, is_status=<value
>>optimized
>>out>) at builtin/commit.c:414
>>                 
>>#5  0x000000000041d878 in cmd_commit (argc=0, argv=0x7fff6dda0310,
>>prefix=0x0) at builtin/commit.c:1403
>>                 
>>                 
>>  
>>
>>
>>Josh
>>
>>
>>On 12/8/11 4:09 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote:
>>
>>>On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote:
>>>
>>>>On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone
>>>><joshua.redstone@fb.com>
>>>>wrote:
>>>>> Hi Duy,
>>>>> Thanks for the documentation link.
>>>>>
>>>>> git ls-files shows 100k files, which matches # of files in the
>>>>>working
>>>>> tree ('find . -type f -print | wc -l').
>>>>
>>>>Any chance you can split it into smaller repositories, or remove files
>>>>from working directory (e.g. if you store logs, you don't have to keep
>>>>logs from all time in working directory, they can be retrieved from
>>>>history).
>>>
>>>It's not really feasible to split it into smaller repositories.  In
>>>fact,
>>>we're expecting it to grow between 3x and 5x in number of files and
>>>number
>>>of commits.
>>>
>>>>
>>>>> I added a 'git read-tree HEAD' before the git-add, and a 'git
>>>>>write-tree'
>>>>> after the add.  With that, the commit time slowed down to 8 seconds
>>>>>per
>>>>> commit, plus 4 more seconds for the read-tree/add/write-tree ops.
>>>>>The
>>>>> read-tree/add/write-tree each took about a second.
>>>>
>>>>read-tree destroys stat info in index, refreshing 100k entries in
>>>>index in this case may take some time. Try this to see if commit time
>>>>reduces and how much time update-index takes
>>>>
>>>>read-tree HEAD
>>>>update-index --refresh
>>>>add ....
>>>>write-tree
>>>>commit -q
>>>
>>>I added the "update-index --refresh" and the time for commit became more
>>>like 0.6 seconds.
>>>In this setup: read-tree takes ~2 seconds, update-index takes ~8
>>>seconds,
>>>git-add takes 1 to 4 seconds, and write-tree takes less than 1 second.
>>>
>>>>
>>>>> As an experiment, I also tried removing the 'git read-tree' and just
>>>>> having the git-write-tree.  That sped up commits to 0.6 seconds, but
>>>>>the
>>>>> overall time for add/write-tree/commit was still 3 to 6 seconds.
>>>>
>>>>overall time is not really important because we duplicate work here
>>>>(write-tree is done as part of commit again). What I'm trying to do is
>>>>to determine how much time each operation in commit may take.
>>>>-- 
>>>>Duy
>>>
>>
>

next prev parent reply	other threads:[~2011-12-20  0:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-02 23:17 Debugging git-commit slowness on a large repo Joshua Redstone
2011-12-03  0:23 ` Carlos Martín Nieto
2011-12-05 17:38   ` Junio C Hamano
2011-12-07  1:48   ` Joshua Redstone
2011-12-07  2:08     ` Nguyen Thai Ngoc Duy
2011-12-07 22:48       ` Joshua Redstone
2011-12-08  1:39         ` Nguyen Thai Ngoc Duy
2011-12-09  0:09           ` Joshua Redstone
2011-12-09  0:17             ` Joshua Redstone
2011-12-13  0:15               ` Joshua Redstone
2011-12-20  0:51                 ` Joshua Redstone [this message]
2011-12-20  1:21                   ` Junio C Hamano
2011-12-20  1:40                     ` Joshua Redstone
2011-12-20  9:23                       ` Thomas Rast
2011-12-20 19:26                         ` Joshua Redstone
2011-12-04 13:54 ` Tomas Carnecky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CB1518AB.2D649%joshua.redstone@fb.com \
    --to=joshua.redstone@fb.com \
    --cc=cmn@elego.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=tom@dbservice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.