All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joshua Redstone <joshua.redstone@fb.com>
To: "Nguyen Thai Ngoc Duy" <pclouds@gmail.com>,
	"Carlos Martín Nieto" <cmn@elego.de>,
	"Tomas Carnecky" <tom@dbservice.com>,
	"Junio C Hamano" <gitster@pobox.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Debugging git-commit slowness on a large repo
Date: Tue, 13 Dec 2011 00:15:30 +0000	[thread overview]
Message-ID: <CB0BCE02.2CD42%joshua.redstone@fb.com> (raw)
In-Reply-To: <CB069308.2C9DD%joshua.redstone@fb.com>

Sorry for the poor formatting of the stack trace.

I've written two scripts to reproduce the slow commit behavior that I see.
 I've posted both to:
   https://gist.github.com/1469760

To repro, first create a dir with lots of files (it defaults to creating 1
million files in 1000 dirs):

$ loadGen.py --baseDir=./bigdir

then, run the simulator scripts to generate and commit a series of small
changes to the repo:

$ git reset --hard HEAD && simulate.py ./bigdir git

The git reset is to clean up any cruft left over from a previous partial
invocation of simulate.py

Note that loadGen.py defaults to creating 1 million files and committing
them in one commit.  With a flash drive this took < 30 min, and subsequent
small commits in simulate.py took about 6 seconds.  With a hard-drive,
it's taking > 1hr (still waiting for it to finish).

Cheers,
Josh


On 12/8/11 4:17 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote:

>Btw, I also tried doing some very poor-man's profiling on "git commit"
>without any of the readtree/writetree/updateindex commands.
>
>Around 50% of the time was in (bottom few frames may have varied)
>
>#1  0x00000000004c467e in find_pack_entry (sha1=0x1475a44 ,
>e=0x7fff2621f070) at sha1_file.c:2027
>#2  0x00000000004c57b0 in has_sha1_file (sha1=0x7fe2cd9c7900 "00") at
>sha1_file.c:2567  
>                  
>                 
>#3  0x000000000046e4af in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:333        
>                  
>                  
>            
>#4  0x000000000046e278 in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:285        
>                  
>                  
>            
>#5  0x000000000046e278 in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:285        
>                  
>                  
>            
>#6  0x000000000046e278 in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:285        
>                  
>                  
>            
>#7  0x000000000046e278 in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:285        
>                  
>                  
>            
>#8  0x000000000046e278 in update_one (it=<value optimized out>,
>cache=<value optimized out>, entries=<value optimized out>, base=<value
>optimized out>, baselen=<value optimized out>, missing_ok=<value optimized
>out>, dryrun=0) at cache-\
>tree.c:285        
>                  
>                  
>            
>#9  0x000000000046e869 in cache_tree_update (it=<value optimized out>,
>cache=<value optimized out>, entries=dwarf2_read_address: Corrupted DWARF
>expression.       
>                 
>) at cache-tree.c:379
>                  
>                  
>            
>#10 0x000000000041cade in prepare_to_commit (index_file=0x781740
>".git/index", prefix=<value optimized out>, current_head=<value optimized
>out>, s=0x7fff26220d00, author_ident=<value optimized out>) at
>builtin/commit.c:866
>#11 0x000000000041d891 in cmd_commit (argc=0, argv=0x7fff262213a0,
>prefix=0x0) at builtin/commit.c:1407
>                  
>                  
>#12 0x0000000000404bf7 in handle_internal_command (argc=4,
>argv=0x7fff262213a0) at git.c:308
>                  
>                  
>#13 0x0000000000404e2f in main (argc=4, argv=0x7fff262213a0) at git.c:512
>                  
>                  
>            
> 
>
>
>And 30% of the time was in:
>
>#0  0x00000034af2c34a5 in _lxstat () from /lib64/libc.so.6
>                  
>                  
>            
>#1  0x00000000004abe0f in refresh_cache_ent (istate=0x780940,
>ce=0x7f8462a34e40, options=0, err=0x7fff6dd9f588) at
>/usr/include/sys/stat.h:443
>                  
>#2  0x00000000004ac1a0 in refresh_index (istate=0x780940, flags=<value
>optimized out>, pathspec=<value optimized out>, seen=<value optimized
>out>, header_msg=0x0) at read-cache.c:1133
>                  
>#3  0x000000000041b60a in refresh_cache_or_die (refresh_flags=<value
>optimized out>) at builtin/commit.c:331
>                  
>                  
>#4  0x000000000041bc39 in prepare_index (argc=0, argv=0x7fff6dda0310,
>prefix=0x0, current_head=<value optimized out>, is_status=<value optimized
>out>) at builtin/commit.c:414
>                 
>#5  0x000000000041d878 in cmd_commit (argc=0, argv=0x7fff6dda0310,
>prefix=0x0) at builtin/commit.c:1403
>                  
>                  
>  
>
>
>Josh
>
>
>On 12/8/11 4:09 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote:
>
>>On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote:
>>
>>>On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone <joshua.redstone@fb.com>
>>>wrote:
>>>> Hi Duy,
>>>> Thanks for the documentation link.
>>>>
>>>> git ls-files shows 100k files, which matches # of files in the working
>>>> tree ('find . -type f -print | wc -l').
>>>
>>>Any chance you can split it into smaller repositories, or remove files
>>>from working directory (e.g. if you store logs, you don't have to keep
>>>logs from all time in working directory, they can be retrieved from
>>>history).
>>
>>It's not really feasible to split it into smaller repositories.  In fact,
>>we're expecting it to grow between 3x and 5x in number of files and
>>number
>>of commits.
>>
>>>
>>>> I added a 'git read-tree HEAD' before the git-add, and a 'git
>>>>write-tree'
>>>> after the add.  With that, the commit time slowed down to 8 seconds
>>>>per
>>>> commit, plus 4 more seconds for the read-tree/add/write-tree ops.  The
>>>> read-tree/add/write-tree each took about a second.
>>>
>>>read-tree destroys stat info in index, refreshing 100k entries in
>>>index in this case may take some time. Try this to see if commit time
>>>reduces and how much time update-index takes
>>>
>>>read-tree HEAD
>>>update-index --refresh
>>>add ....
>>>write-tree
>>>commit -q
>>
>>I added the "update-index --refresh" and the time for commit became more
>>like 0.6 seconds.
>>In this setup: read-tree takes ~2 seconds, update-index takes ~8 seconds,
>>git-add takes 1 to 4 seconds, and write-tree takes less than 1 second.
>>
>>>
>>>> As an experiment, I also tried removing the 'git read-tree' and just
>>>> having the git-write-tree.  That sped up commits to 0.6 seconds, but
>>>>the
>>>> overall time for add/write-tree/commit was still 3 to 6 seconds.
>>>
>>>overall time is not really important because we duplicate work here
>>>(write-tree is done as part of commit again). What I'm trying to do is
>>>to determine how much time each operation in commit may take.
>>>-- 
>>>Duy
>>
>

  reply	other threads:[~2011-12-13  0:16 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-02 23:17 Debugging git-commit slowness on a large repo Joshua Redstone
2011-12-03  0:23 ` Carlos Martín Nieto
2011-12-05 17:38   ` Junio C Hamano
2011-12-07  1:48   ` Joshua Redstone
2011-12-07  2:08     ` Nguyen Thai Ngoc Duy
2011-12-07 22:48       ` Joshua Redstone
2011-12-08  1:39         ` Nguyen Thai Ngoc Duy
2011-12-09  0:09           ` Joshua Redstone
2011-12-09  0:17             ` Joshua Redstone
2011-12-13  0:15               ` Joshua Redstone [this message]
2011-12-20  0:51                 ` Joshua Redstone
2011-12-20  1:21                   ` Junio C Hamano
2011-12-20  1:40                     ` Joshua Redstone
2011-12-20  9:23                       ` Thomas Rast
2011-12-20 19:26                         ` Joshua Redstone
2011-12-04 13:54 ` Tomas Carnecky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CB0BCE02.2CD42%joshua.redstone@fb.com \
    --to=joshua.redstone@fb.com \
    --cc=cmn@elego.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=tom@dbservice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.