* Debugging git-commit slowness on a large repo @ 2011-12-02 23:17 Joshua Redstone 2011-12-03 0:23 ` Carlos Martín Nieto 2011-12-04 13:54 ` Tomas Carnecky 0 siblings, 2 replies; 16+ messages in thread From: Joshua Redstone @ 2011-12-02 23:17 UTC (permalink / raw) To: git@vger.kernel.org Hi, I have a git repo with about 300k commits, 150k files totaling maybe 7GB. Locally committing a small change - say touching fewer than 300 bytes across 4 files - consistently takes over one second, which seems kinda slow. This is using git 1.7.7.4 on a linux 2.6 box. The time does not improve after doing a git-gc (my .git dir has maybe 250 files after a git gc). The same size commit on a brand new repo takes < 10ms. Any thoughts on why committing a small change seems to take a long time on larger repos? Fwiw, I also tried doing the same test using libgit2 (via the pygit2 wrapper), and it was ever slower (about 6 seconds to commit the same small change). Thanks for any thoughts or places to look. Cheers, Josh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-02 23:17 Debugging git-commit slowness on a large repo Joshua Redstone @ 2011-12-03 0:23 ` Carlos Martín Nieto 2011-12-05 17:38 ` Junio C Hamano 2011-12-07 1:48 ` Joshua Redstone 2011-12-04 13:54 ` Tomas Carnecky 1 sibling, 2 replies; 16+ messages in thread From: Carlos Martín Nieto @ 2011-12-03 0:23 UTC (permalink / raw) To: Joshua Redstone; +Cc: git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2071 bytes --] On Fri, Dec 02, 2011 at 11:17:10PM +0000, Joshua Redstone wrote: > Hi, > I have a git repo with about 300k commits, 150k files totaling maybe 7GB. > Locally committing a small change - say touching fewer than 300 bytes > across 4 files - consistently takes over one second, which seems kinda > slow. This is using git 1.7.7.4 on a linux 2.6 box. The time does not > improve after doing a git-gc (my .git dir has maybe 250 files after a git > gc). The same size commit on a brand new repo takes < 10ms. Any thoughts > on why committing a small change seems to take a long time on larger repos? By "same size commit" do you mean the same amount of changes, or the same amount of files? Committing doesn't depend on the size of the repo (by itself), but on the size of the index, which depends on the amount of files to be committed (as git is snapshot-based). At one point, commit forgot how to write the tree cache to the index (a performance optimisation). Do the times improve if you run 'git read-tree HEAD' between one commit and another? Note that this will reset the index to the last commit, though for the tests I image you use some variation of 'git commit -a'. Thomas Rast wrote a patch to re-teach commit to store the tree cache, but there were some issues and never got applied. > > Fwiw, I also tried doing the same test using libgit2 (via the pygit2 > wrapper), and it was ever slower (about 6 seconds to commit the same small > change). I don't know about the python bindings, but on the (somewhat unscientific) tests for libgit2's write-tree (the slow part of a creating a commit), it performs slightly faster than git's (though I think git's write-tree does update the tree cache, which libgit2 doesn't currently). The speed could just be a side-effect of the small test repo. From your domain, I assume the data is not for public consumption, but it'd be great if you could post your code to pygit2's issue tracker so we can see how much of the slowdown comes from the bindings or the library. cmn [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-03 0:23 ` Carlos Martín Nieto @ 2011-12-05 17:38 ` Junio C Hamano 2011-12-07 1:48 ` Joshua Redstone 1 sibling, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2011-12-05 17:38 UTC (permalink / raw) To: Carlos Martín Nieto, Thomas Rast Cc: Joshua Redstone, git@vger.kernel.org Carlos Martín Nieto <cmn@elego.de> writes: > ... At one > point, commit forgot how to write the tree cache to the index (a > performance optimisation). Do the times improve if you run 'git > read-tree HEAD' between one commit and another? Note that this will > reset the index to the last commit, though for the tests I image you > use some variation of 'git commit -a'. > > Thomas Rast wrote a patch to re-teach commit to store the tree cache, > but there were some issues and never got applied. Ahh, I forgot all about that exchange. http://thread.gmane.org/gmane.comp.version-control.git/178480/focus=178515 The cache-tree mechanism has traditionally been one of the more important optimizations and it would be very nice if we can resurrect the behaviour for "git commit" too. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-03 0:23 ` Carlos Martín Nieto 2011-12-05 17:38 ` Junio C Hamano @ 2011-12-07 1:48 ` Joshua Redstone 2011-12-07 2:08 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-07 1:48 UTC (permalink / raw) To: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano Cc: git@vger.kernel.org Hi Carlos and Tomas and Junio, @Tomas, I tried adding the '--no-status' flag to 'git commit' and it sped things up by maybe 15%, but commits still take a second. @Carlos, by "same size", I mean roughly the same number of files and number of bytes modified in each file. In all experiments, it's less than 5 files modified per commit with changes totaling fewer than 10 KB, often more like 1 KB. I actually wrote a test script to generate commits, customized for the stats on the repo I'm using. It repeatedly generates some changes, does 'git add [ list of files changed ]' followed by 'git commit --no-status -m [ msg ]'. It generates changes by picking fewer than 5 files at random, modifying two 100-byte regions in each file, and occasionally creates a new file of about 1 KB. If it helps, I can probably post the test script I've been using. I tried doing a 'git read-tree HEAD' before each 'git add ; git commit' iteration, and the time for git-commit jumped from about 1 second to about 8 seconds. That is a pretty dramatic slowdown. Any idea why? I wonder if that's related to the overall commit slowness. @Carlos and/or @Junio, can you point me at any docs/code to understand what a tree-cache is and how it differs from the index? I did a google search for [git tree-cache index], but nothing popped out. Cheers, Josh On 12/2/11 4:23 PM, "Carlos Martín Nieto" <cmn@elego.de> wrote: >On Fri, Dec 02, 2011 at 11:17:10PM +0000, Joshua Redstone wrote: >> Hi, >> I have a git repo with about 300k commits, 150k files totaling maybe >>7GB. >> Locally committing a small change - say touching fewer than 300 bytes >> across 4 files - consistently takes over one second, which seems kinda >> slow. This is using git 1.7.7.4 on a linux 2.6 box. The time does not >> improve after doing a git-gc (my .git dir has maybe 250 files after a >>git >> gc). The same size commit on a brand new repo takes < 10ms. Any >>thoughts >> on why committing a small change seems to take a long time on larger >>repos? > >By "same size commit" do you mean the same amount of changes, or the >same amount of files? Committing doesn't depend on the size of the >repo (by itself), but on the size of the index, which depends on the >amount of files to be committed (as git is snapshot-based). At one >point, commit forgot how to write the tree cache to the index (a >performance optimisation). Do the times improve if you run 'git >read-tree HEAD' between one commit and another? Note that this will >reset the index to the last commit, though for the tests I image you >use some variation of 'git commit -a'. > >Thomas Rast wrote a patch to re-teach commit to store the tree cache, >but there were some issues and never got applied. > >> >> Fwiw, I also tried doing the same test using libgit2 (via the pygit2 >> wrapper), and it was ever slower (about 6 seconds to commit the same >>small >> change). > >I don't know about the python bindings, but on the (somewhat >unscientific) tests for libgit2's write-tree (the slow part of a >creating a commit), it performs slightly faster than git's (though I >think git's write-tree does update the tree cache, which libgit2 >doesn't currently). The speed could just be a side-effect of the small >test repo. From your domain, I assume the data is not for public >consumption, but it'd be great if you could post your code to pygit2's >issue tracker so we can see how much of the slowdown comes from the >bindings or the library. > > cmn > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-07 1:48 ` Joshua Redstone @ 2011-12-07 2:08 ` Nguyen Thai Ngoc Duy 2011-12-07 22:48 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-12-07 2:08 UTC (permalink / raw) To: Joshua Redstone Cc: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano, git@vger.kernel.org On Wed, Dec 7, 2011 at 8:48 AM, Joshua Redstone <joshua.redstone@fb.com> wrote: > I tried doing a 'git read-tree HEAD' before each 'git add ; git commit' > iteration, and the time for git-commit jumped from about 1 second to about > 8 seconds. That is a pretty dramatic slowdown. Any idea why? I wonder > if that's related to the overall commit slowness. How big is your working directory? "git ls-files | wc -l" should show it. Try "git read-tree HEAD; git add; git write-tree" and see if the write-tree part takes as much time as commit. write-tree is mainly about cache-tree generation. > @Carlos and/or @Junio, can you point me at any docs/code to understand > what a tree-cache is and how it differs from the index? I did a google > search for [git tree-cache index], but nothing popped out. Have a look at Documentation/technical/index-format.txt. Cache tree extension is near the end. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-07 2:08 ` Nguyen Thai Ngoc Duy @ 2011-12-07 22:48 ` Joshua Redstone 2011-12-08 1:39 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-07 22:48 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano, git@vger.kernel.org Hi Duy, Thanks for the documentation link. git ls-files shows 100k files, which matches # of files in the working tree ('find . -type f -print | wc -l'). I added a 'git read-tree HEAD' before the git-add, and a 'git write-tree' after the add. With that, the commit time slowed down to 8 seconds per commit, plus 4 more seconds for the read-tree/add/write-tree ops. The read-tree/add/write-tree each took about a second. As an experiment, I also tried removing the 'git read-tree' and just having the git-write-tree. That sped up commits to 0.6 seconds, but the overall time for add/write-tree/commit was still 3 to 6 seconds. For comparison, without the read-tree and write-tree, commits take about 1 second and add/commit in total takes about 2 seconds. It surprises me that the presence of git read-tree or write-tree would slow things down so much. Josh On 12/6/11 6:08 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote: >On Wed, Dec 7, 2011 at 8:48 AM, Joshua Redstone <joshua.redstone@fb.com> >wrote: >> I tried doing a 'git read-tree HEAD' before each 'git add ; git commit' >> iteration, and the time for git-commit jumped from about 1 second to >>about >> 8 seconds. That is a pretty dramatic slowdown. Any idea why? I wonder >> if that's related to the overall commit slowness. > >How big is your working directory? "git ls-files | wc -l" should show >it. Try "git read-tree HEAD; git add; git write-tree" and see if the >write-tree part takes as much time as commit. write-tree is mainly >about cache-tree generation. > >> @Carlos and/or @Junio, can you point me at any docs/code to understand >> what a tree-cache is and how it differs from the index? I did a google >> search for [git tree-cache index], but nothing popped out. > >Have a look at Documentation/technical/index-format.txt. Cache tree >extension is near the end. >-- >Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-07 22:48 ` Joshua Redstone @ 2011-12-08 1:39 ` Nguyen Thai Ngoc Duy 2011-12-09 0:09 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-12-08 1:39 UTC (permalink / raw) To: Joshua Redstone Cc: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano, git@vger.kernel.org On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone <joshua.redstone@fb.com> wrote: > Hi Duy, > Thanks for the documentation link. > > git ls-files shows 100k files, which matches # of files in the working > tree ('find . -type f -print | wc -l'). Any chance you can split it into smaller repositories, or remove files from working directory (e.g. if you store logs, you don't have to keep logs from all time in working directory, they can be retrieved from history). > I added a 'git read-tree HEAD' before the git-add, and a 'git write-tree' > after the add. With that, the commit time slowed down to 8 seconds per > commit, plus 4 more seconds for the read-tree/add/write-tree ops. The > read-tree/add/write-tree each took about a second. read-tree destroys stat info in index, refreshing 100k entries in index in this case may take some time. Try this to see if commit time reduces and how much time update-index takes read-tree HEAD update-index --refresh add .... write-tree commit -q > As an experiment, I also tried removing the 'git read-tree' and just > having the git-write-tree. That sped up commits to 0.6 seconds, but the > overall time for add/write-tree/commit was still 3 to 6 seconds. overall time is not really important because we duplicate work here (write-tree is done as part of commit again). What I'm trying to do is to determine how much time each operation in commit may take. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-08 1:39 ` Nguyen Thai Ngoc Duy @ 2011-12-09 0:09 ` Joshua Redstone 2011-12-09 0:17 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-09 0:09 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano, git@vger.kernel.org On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote: >On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone <joshua.redstone@fb.com> >wrote: >> Hi Duy, >> Thanks for the documentation link. >> >> git ls-files shows 100k files, which matches # of files in the working >> tree ('find . -type f -print | wc -l'). > >Any chance you can split it into smaller repositories, or remove files >from working directory (e.g. if you store logs, you don't have to keep >logs from all time in working directory, they can be retrieved from >history). It's not really feasible to split it into smaller repositories. In fact, we're expecting it to grow between 3x and 5x in number of files and number of commits. > >> I added a 'git read-tree HEAD' before the git-add, and a 'git >>write-tree' >> after the add. With that, the commit time slowed down to 8 seconds per >> commit, plus 4 more seconds for the read-tree/add/write-tree ops. The >> read-tree/add/write-tree each took about a second. > >read-tree destroys stat info in index, refreshing 100k entries in >index in this case may take some time. Try this to see if commit time >reduces and how much time update-index takes > >read-tree HEAD >update-index --refresh >add .... >write-tree >commit -q I added the "update-index --refresh" and the time for commit became more like 0.6 seconds. In this setup: read-tree takes ~2 seconds, update-index takes ~8 seconds, git-add takes 1 to 4 seconds, and write-tree takes less than 1 second. > >> As an experiment, I also tried removing the 'git read-tree' and just >> having the git-write-tree. That sped up commits to 0.6 seconds, but the >> overall time for add/write-tree/commit was still 3 to 6 seconds. > >overall time is not really important because we duplicate work here >(write-tree is done as part of commit again). What I'm trying to do is >to determine how much time each operation in commit may take. >-- >Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-09 0:09 ` Joshua Redstone @ 2011-12-09 0:17 ` Joshua Redstone 2011-12-13 0:15 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-09 0:17 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano, git@vger.kernel.org Btw, I also tried doing some very poor-man's profiling on "git commit" without any of the readtree/writetree/updateindex commands. Around 50% of the time was in (bottom few frames may have varied) #1 0x00000000004c467e in find_pack_entry (sha1=0x1475a44 , e=0x7fff2621f070) at sha1_file.c:2027 #2 0x00000000004c57b0 in has_sha1_file (sha1=0x7fe2cd9c7900 "00") at sha1_file.c:2567 #3 0x000000000046e4af in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:333 #4 0x000000000046e278 in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:285 #5 0x000000000046e278 in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:285 #6 0x000000000046e278 in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:285 #7 0x000000000046e278 in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:285 #8 0x000000000046e278 in update_one (it=<value optimized out>, cache=<value optimized out>, entries=<value optimized out>, base=<value optimized out>, baselen=<value optimized out>, missing_ok=<value optimized out>, dryrun=0) at cache-\ tree.c:285 #9 0x000000000046e869 in cache_tree_update (it=<value optimized out>, cache=<value optimized out>, entries=dwarf2_read_address: Corrupted DWARF expression. ) at cache-tree.c:379 #10 0x000000000041cade in prepare_to_commit (index_file=0x781740 ".git/index", prefix=<value optimized out>, current_head=<value optimized out>, s=0x7fff26220d00, author_ident=<value optimized out>) at builtin/commit.c:866 #11 0x000000000041d891 in cmd_commit (argc=0, argv=0x7fff262213a0, prefix=0x0) at builtin/commit.c:1407 #12 0x0000000000404bf7 in handle_internal_command (argc=4, argv=0x7fff262213a0) at git.c:308 #13 0x0000000000404e2f in main (argc=4, argv=0x7fff262213a0) at git.c:512 And 30% of the time was in: #0 0x00000034af2c34a5 in _lxstat () from /lib64/libc.so.6 #1 0x00000000004abe0f in refresh_cache_ent (istate=0x780940, ce=0x7f8462a34e40, options=0, err=0x7fff6dd9f588) at /usr/include/sys/stat.h:443 #2 0x00000000004ac1a0 in refresh_index (istate=0x780940, flags=<value optimized out>, pathspec=<value optimized out>, seen=<value optimized out>, header_msg=0x0) at read-cache.c:1133 #3 0x000000000041b60a in refresh_cache_or_die (refresh_flags=<value optimized out>) at builtin/commit.c:331 #4 0x000000000041bc39 in prepare_index (argc=0, argv=0x7fff6dda0310, prefix=0x0, current_head=<value optimized out>, is_status=<value optimized out>) at builtin/commit.c:414 #5 0x000000000041d878 in cmd_commit (argc=0, argv=0x7fff6dda0310, prefix=0x0) at builtin/commit.c:1403 Josh On 12/8/11 4:09 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: >On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote: > >>On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone <joshua.redstone@fb.com> >>wrote: >>> Hi Duy, >>> Thanks for the documentation link. >>> >>> git ls-files shows 100k files, which matches # of files in the working >>> tree ('find . -type f -print | wc -l'). >> >>Any chance you can split it into smaller repositories, or remove files >>from working directory (e.g. if you store logs, you don't have to keep >>logs from all time in working directory, they can be retrieved from >>history). > >It's not really feasible to split it into smaller repositories. In fact, >we're expecting it to grow between 3x and 5x in number of files and number >of commits. > >> >>> I added a 'git read-tree HEAD' before the git-add, and a 'git >>>write-tree' >>> after the add. With that, the commit time slowed down to 8 seconds per >>> commit, plus 4 more seconds for the read-tree/add/write-tree ops. The >>> read-tree/add/write-tree each took about a second. >> >>read-tree destroys stat info in index, refreshing 100k entries in >>index in this case may take some time. Try this to see if commit time >>reduces and how much time update-index takes >> >>read-tree HEAD >>update-index --refresh >>add .... >>write-tree >>commit -q > >I added the "update-index --refresh" and the time for commit became more >like 0.6 seconds. >In this setup: read-tree takes ~2 seconds, update-index takes ~8 seconds, >git-add takes 1 to 4 seconds, and write-tree takes less than 1 second. > >> >>> As an experiment, I also tried removing the 'git read-tree' and just >>> having the git-write-tree. That sped up commits to 0.6 seconds, but >>>the >>> overall time for add/write-tree/commit was still 3 to 6 seconds. >> >>overall time is not really important because we duplicate work here >>(write-tree is done as part of commit again). What I'm trying to do is >>to determine how much time each operation in commit may take. >>-- >>Duy > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-09 0:17 ` Joshua Redstone @ 2011-12-13 0:15 ` Joshua Redstone 2011-12-20 0:51 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-13 0:15 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano Cc: git@vger.kernel.org Sorry for the poor formatting of the stack trace. I've written two scripts to reproduce the slow commit behavior that I see. I've posted both to: https://gist.github.com/1469760 To repro, first create a dir with lots of files (it defaults to creating 1 million files in 1000 dirs): $ loadGen.py --baseDir=./bigdir then, run the simulator scripts to generate and commit a series of small changes to the repo: $ git reset --hard HEAD && simulate.py ./bigdir git The git reset is to clean up any cruft left over from a previous partial invocation of simulate.py Note that loadGen.py defaults to creating 1 million files and committing them in one commit. With a flash drive this took < 30 min, and subsequent small commits in simulate.py took about 6 seconds. With a hard-drive, it's taking > 1hr (still waiting for it to finish). Cheers, Josh On 12/8/11 4:17 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: >Btw, I also tried doing some very poor-man's profiling on "git commit" >without any of the readtree/writetree/updateindex commands. > >Around 50% of the time was in (bottom few frames may have varied) > >#1 0x00000000004c467e in find_pack_entry (sha1=0x1475a44 , >e=0x7fff2621f070) at sha1_file.c:2027 >#2 0x00000000004c57b0 in has_sha1_file (sha1=0x7fe2cd9c7900 "00") at >sha1_file.c:2567 > > >#3 0x000000000046e4af in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:333 > > > >#4 0x000000000046e278 in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:285 > > > >#5 0x000000000046e278 in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:285 > > > >#6 0x000000000046e278 in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:285 > > > >#7 0x000000000046e278 in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:285 > > > >#8 0x000000000046e278 in update_one (it=<value optimized out>, >cache=<value optimized out>, entries=<value optimized out>, base=<value >optimized out>, baselen=<value optimized out>, missing_ok=<value optimized >out>, dryrun=0) at cache-\ >tree.c:285 > > > >#9 0x000000000046e869 in cache_tree_update (it=<value optimized out>, >cache=<value optimized out>, entries=dwarf2_read_address: Corrupted DWARF >expression. > >) at cache-tree.c:379 > > > >#10 0x000000000041cade in prepare_to_commit (index_file=0x781740 >".git/index", prefix=<value optimized out>, current_head=<value optimized >out>, s=0x7fff26220d00, author_ident=<value optimized out>) at >builtin/commit.c:866 >#11 0x000000000041d891 in cmd_commit (argc=0, argv=0x7fff262213a0, >prefix=0x0) at builtin/commit.c:1407 > > >#12 0x0000000000404bf7 in handle_internal_command (argc=4, >argv=0x7fff262213a0) at git.c:308 > > >#13 0x0000000000404e2f in main (argc=4, argv=0x7fff262213a0) at git.c:512 > > > > > > >And 30% of the time was in: > >#0 0x00000034af2c34a5 in _lxstat () from /lib64/libc.so.6 > > > >#1 0x00000000004abe0f in refresh_cache_ent (istate=0x780940, >ce=0x7f8462a34e40, options=0, err=0x7fff6dd9f588) at >/usr/include/sys/stat.h:443 > >#2 0x00000000004ac1a0 in refresh_index (istate=0x780940, flags=<value >optimized out>, pathspec=<value optimized out>, seen=<value optimized >out>, header_msg=0x0) at read-cache.c:1133 > >#3 0x000000000041b60a in refresh_cache_or_die (refresh_flags=<value >optimized out>) at builtin/commit.c:331 > > >#4 0x000000000041bc39 in prepare_index (argc=0, argv=0x7fff6dda0310, >prefix=0x0, current_head=<value optimized out>, is_status=<value optimized >out>) at builtin/commit.c:414 > >#5 0x000000000041d878 in cmd_commit (argc=0, argv=0x7fff6dda0310, >prefix=0x0) at builtin/commit.c:1403 > > > > > >Josh > > >On 12/8/11 4:09 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: > >>On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote: >> >>>On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone <joshua.redstone@fb.com> >>>wrote: >>>> Hi Duy, >>>> Thanks for the documentation link. >>>> >>>> git ls-files shows 100k files, which matches # of files in the working >>>> tree ('find . -type f -print | wc -l'). >>> >>>Any chance you can split it into smaller repositories, or remove files >>>from working directory (e.g. if you store logs, you don't have to keep >>>logs from all time in working directory, they can be retrieved from >>>history). >> >>It's not really feasible to split it into smaller repositories. In fact, >>we're expecting it to grow between 3x and 5x in number of files and >>number >>of commits. >> >>> >>>> I added a 'git read-tree HEAD' before the git-add, and a 'git >>>>write-tree' >>>> after the add. With that, the commit time slowed down to 8 seconds >>>>per >>>> commit, plus 4 more seconds for the read-tree/add/write-tree ops. The >>>> read-tree/add/write-tree each took about a second. >>> >>>read-tree destroys stat info in index, refreshing 100k entries in >>>index in this case may take some time. Try this to see if commit time >>>reduces and how much time update-index takes >>> >>>read-tree HEAD >>>update-index --refresh >>>add .... >>>write-tree >>>commit -q >> >>I added the "update-index --refresh" and the time for commit became more >>like 0.6 seconds. >>In this setup: read-tree takes ~2 seconds, update-index takes ~8 seconds, >>git-add takes 1 to 4 seconds, and write-tree takes less than 1 second. >> >>> >>>> As an experiment, I also tried removing the 'git read-tree' and just >>>> having the git-write-tree. That sped up commits to 0.6 seconds, but >>>>the >>>> overall time for add/write-tree/commit was still 3 to 6 seconds. >>> >>>overall time is not really important because we duplicate work here >>>(write-tree is done as part of commit again). What I'm trying to do is >>>to determine how much time each operation in commit may take. >>>-- >>>Duy >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-13 0:15 ` Joshua Redstone @ 2011-12-20 0:51 ` Joshua Redstone 2011-12-20 1:21 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-20 0:51 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, Junio C Hamano Cc: git@vger.kernel.org I've managed to speed up git-commit on large repos by 4x by removing some safeguards that caused git to stat every file in the repo on commits that touch a small number of files. The diff, for illustrative purposes only, is at: https://gist.github.com/1499621 With a repo with 1 million files (but few commits), the diff drops the commit time down from 7.3 seconds to 1.8 seconds, a 75% decrease. The optimizations are: 1. Remove call to refresh_cache_or_die that stats every file in the repo, i think the purpose is to detect any changes between git-add and git-commit. 2. Pass missing_ok=true to cache_tree_update. This causes the tree generation code to not stat every file in the repo to verify it still exists as a git object. 3. Remove pair discard_cache/read_cache_from, which rereads the index file. I think this was in case a pre-commit hook changed the set of things being committed. It may be worth making some of these flag-enabled. Josh On 12/12/11 4:15 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: >Sorry for the poor formatting of the stack trace. > >I've written two scripts to reproduce the slow commit behavior that I see. > I've posted both to: > https://gist.github.com/1469760 > >To repro, first create a dir with lots of files (it defaults to creating 1 >million files in 1000 dirs): > >$ loadGen.py --baseDir=./bigdir > >then, run the simulator scripts to generate and commit a series of small >changes to the repo: > >$ git reset --hard HEAD && simulate.py ./bigdir git > >The git reset is to clean up any cruft left over from a previous partial >invocation of simulate.py > >Note that loadGen.py defaults to creating 1 million files and committing >them in one commit. With a flash drive this took < 30 min, and subsequent >small commits in simulate.py took about 6 seconds. With a hard-drive, >it's taking > 1hr (still waiting for it to finish). > >Cheers, >Josh > > >On 12/8/11 4:17 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: > >>Btw, I also tried doing some very poor-man's profiling on "git commit" >>without any of the readtree/writetree/updateindex commands. >> >>Around 50% of the time was in (bottom few frames may have varied) >> >>#1 0x00000000004c467e in find_pack_entry (sha1=0x1475a44 , >>e=0x7fff2621f070) at sha1_file.c:2027 >>#2 0x00000000004c57b0 in has_sha1_file (sha1=0x7fe2cd9c7900 "00") at >>sha1_file.c:2567 >> >> >>#3 0x000000000046e4af in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:333 >> >> >> >>#4 0x000000000046e278 in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:285 >> >> >> >>#5 0x000000000046e278 in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:285 >> >> >> >>#6 0x000000000046e278 in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:285 >> >> >> >>#7 0x000000000046e278 in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:285 >> >> >> >>#8 0x000000000046e278 in update_one (it=<value optimized out>, >>cache=<value optimized out>, entries=<value optimized out>, base=<value >>optimized out>, baselen=<value optimized out>, missing_ok=<value >>optimized >>out>, dryrun=0) at cache-\ >>tree.c:285 >> >> >> >>#9 0x000000000046e869 in cache_tree_update (it=<value optimized out>, >>cache=<value optimized out>, entries=dwarf2_read_address: Corrupted DWARF >>expression. >> >>) at cache-tree.c:379 >> >> >> >>#10 0x000000000041cade in prepare_to_commit (index_file=0x781740 >>".git/index", prefix=<value optimized out>, current_head=<value optimized >>out>, s=0x7fff26220d00, author_ident=<value optimized out>) at >>builtin/commit.c:866 >>#11 0x000000000041d891 in cmd_commit (argc=0, argv=0x7fff262213a0, >>prefix=0x0) at builtin/commit.c:1407 >> >> >>#12 0x0000000000404bf7 in handle_internal_command (argc=4, >>argv=0x7fff262213a0) at git.c:308 >> >> >>#13 0x0000000000404e2f in main (argc=4, argv=0x7fff262213a0) at git.c:512 >> >> >> >> >> >> >>And 30% of the time was in: >> >>#0 0x00000034af2c34a5 in _lxstat () from /lib64/libc.so.6 >> >> >> >>#1 0x00000000004abe0f in refresh_cache_ent (istate=0x780940, >>ce=0x7f8462a34e40, options=0, err=0x7fff6dd9f588) at >>/usr/include/sys/stat.h:443 >> >>#2 0x00000000004ac1a0 in refresh_index (istate=0x780940, flags=<value >>optimized out>, pathspec=<value optimized out>, seen=<value optimized >>out>, header_msg=0x0) at read-cache.c:1133 >> >>#3 0x000000000041b60a in refresh_cache_or_die (refresh_flags=<value >>optimized out>) at builtin/commit.c:331 >> >> >>#4 0x000000000041bc39 in prepare_index (argc=0, argv=0x7fff6dda0310, >>prefix=0x0, current_head=<value optimized out>, is_status=<value >>optimized >>out>) at builtin/commit.c:414 >> >>#5 0x000000000041d878 in cmd_commit (argc=0, argv=0x7fff6dda0310, >>prefix=0x0) at builtin/commit.c:1403 >> >> >> >> >> >>Josh >> >> >>On 12/8/11 4:09 PM, "Joshua Redstone" <joshua.redstone@fb.com> wrote: >> >>>On 12/7/11 5:39 PM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote: >>> >>>>On Thu, Dec 8, 2011 at 5:48 AM, Joshua Redstone >>>><joshua.redstone@fb.com> >>>>wrote: >>>>> Hi Duy, >>>>> Thanks for the documentation link. >>>>> >>>>> git ls-files shows 100k files, which matches # of files in the >>>>>working >>>>> tree ('find . -type f -print | wc -l'). >>>> >>>>Any chance you can split it into smaller repositories, or remove files >>>>from working directory (e.g. if you store logs, you don't have to keep >>>>logs from all time in working directory, they can be retrieved from >>>>history). >>> >>>It's not really feasible to split it into smaller repositories. In >>>fact, >>>we're expecting it to grow between 3x and 5x in number of files and >>>number >>>of commits. >>> >>>> >>>>> I added a 'git read-tree HEAD' before the git-add, and a 'git >>>>>write-tree' >>>>> after the add. With that, the commit time slowed down to 8 seconds >>>>>per >>>>> commit, plus 4 more seconds for the read-tree/add/write-tree ops. >>>>>The >>>>> read-tree/add/write-tree each took about a second. >>>> >>>>read-tree destroys stat info in index, refreshing 100k entries in >>>>index in this case may take some time. Try this to see if commit time >>>>reduces and how much time update-index takes >>>> >>>>read-tree HEAD >>>>update-index --refresh >>>>add .... >>>>write-tree >>>>commit -q >>> >>>I added the "update-index --refresh" and the time for commit became more >>>like 0.6 seconds. >>>In this setup: read-tree takes ~2 seconds, update-index takes ~8 >>>seconds, >>>git-add takes 1 to 4 seconds, and write-tree takes less than 1 second. >>> >>>> >>>>> As an experiment, I also tried removing the 'git read-tree' and just >>>>> having the git-write-tree. That sped up commits to 0.6 seconds, but >>>>>the >>>>> overall time for add/write-tree/commit was still 3 to 6 seconds. >>>> >>>>overall time is not really important because we duplicate work here >>>>(write-tree is done as part of commit again). What I'm trying to do is >>>>to determine how much time each operation in commit may take. >>>>-- >>>>Duy >>> >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-20 0:51 ` Joshua Redstone @ 2011-12-20 1:21 ` Junio C Hamano 2011-12-20 1:40 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2011-12-20 1:21 UTC (permalink / raw) To: Joshua Redstone Cc: Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, git@vger.kernel.org Joshua Redstone <joshua.redstone@fb.com> writes: > I've managed to speed up git-commit on large repos by 4x by removing some > safeguards that caused git to stat every file in the repo on commits that > touch a small number of files. The diff, for illustrative purposes only, > is at: > > https://gist.github.com/1499621 > > > With a repo with 1 million files (but few commits), the diff drops the > commit time down from 7.3 seconds to 1.8 seconds, a 75% decrease. The > optimizations are: I do not know if these kind of changes are called "optimizations" or merely making the command record a random tree object that may have some resemblance to what you wanted to commit but is subtly incorrect. I didn't fetch your safety removal, though. Wouldn't you get a similar speed-up without being unsafe if you simply ran "git commit" without any parameter (i.e. write out the current index as a tree and make a commit), combined with "--no-status" and perhaps "-q" to avoid running the comparison between the resulting commit and the working tree state after the commit? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-20 1:21 ` Junio C Hamano @ 2011-12-20 1:40 ` Joshua Redstone 2011-12-20 9:23 ` Thomas Rast 0 siblings, 1 reply; 16+ messages in thread From: Joshua Redstone @ 2011-12-20 1:40 UTC (permalink / raw) To: Junio C Hamano Cc: Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, git@vger.kernel.org You're right, more than optimizations, they are modifications that reduce safety checks and make assumptions about the way one is using git (e.g., you always remember to add each file you want to commit). I focused on them because: 1. In our installation, we don't use commit hooks that change what's being committed, so it's good to know that in principle, there's a big perf benefit to be had by leveraging that fact. 2. At an abstract level, it seems like the cost of doing a commit should be proportional to the amount of the repository touched by the commit, not by the size of the repository. These experiments are demonstrations of one direction that a set of optimizations would need to go to get commit performance more along those lines. 3. We're also exploring storage systems that support more efficient ways to query what's changed than stat'ing every file. I forgot to mention, the times I quoted where with --no-verify and --no-status. Adding '-q' didn't speed up performance at all. As a bonus, I've also profiled git-add on the 1-million file repo, and it looks like, as you might expect, the time is dominated by reading and writing the index. The time for git-add is a couple of seconds. Josh On 12/19/11 5:21 PM, "Junio C Hamano" <gitster@pobox.com> wrote: >Joshua Redstone <joshua.redstone@fb.com> writes: > >> I've managed to speed up git-commit on large repos by 4x by removing >>some >> safeguards that caused git to stat every file in the repo on commits >>that >> touch a small number of files. The diff, for illustrative purposes >>only, >> is at: >> >> https://gist.github.com/1499621 >> >> >> With a repo with 1 million files (but few commits), the diff drops the >> commit time down from 7.3 seconds to 1.8 seconds, a 75% decrease. The >> optimizations are: > >I do not know if these kind of changes are called "optimizations" or >merely making the command record a random tree object that may have some >resemblance to what you wanted to commit but is subtly incorrect. I didn't >fetch your safety removal, though. > >Wouldn't you get a similar speed-up without being unsafe if you simply ran >"git commit" without any parameter (i.e. write out the current index as a >tree and make a commit), combined with "--no-status" and perhaps "-q" to >avoid running the comparison between the resulting commit and the working >tree state after the commit? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-20 1:40 ` Joshua Redstone @ 2011-12-20 9:23 ` Thomas Rast 2011-12-20 19:26 ` Joshua Redstone 0 siblings, 1 reply; 16+ messages in thread From: Thomas Rast @ 2011-12-20 9:23 UTC (permalink / raw) To: Joshua Redstone Cc: Junio C Hamano, Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, git@vger.kernel.org Joshua Redstone <joshua.redstone@fb.com> writes: > As a bonus, I've also profiled git-add on the 1-million file repo, and it > looks like, as you might expect, the time is dominated by reading and > writing the index. The time for git-add is a couple of seconds. Note that the time to write the index itself is also rather small, but the time needed to sha1 the index when loading and then again when saving it really hurts. (I noticed this while working on the commit-tree topic.) -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-20 9:23 ` Thomas Rast @ 2011-12-20 19:26 ` Joshua Redstone 0 siblings, 0 replies; 16+ messages in thread From: Joshua Redstone @ 2011-12-20 19:26 UTC (permalink / raw) To: Thomas Rast Cc: Junio C Hamano, Nguyen Thai Ngoc Duy, Carlos Martín Nieto, Tomas Carnecky, git@vger.kernel.org I looked again at my poor-mans-profiling output of git-add. The Sha1 stuff under ce_write_entry->ce_write_flush takes a bunch of time. commit_lock_file->rename takes about the same as well. Btw, the perf numbers for commit and add are with a warm file cache. I expect the benefit of skipping all the stat() calls will increase for cold cache. Josh On 12/20/11 1:23 AM, "Thomas Rast" <trast@student.ethz.ch> wrote: >Joshua Redstone <joshua.redstone@fb.com> writes: >> As a bonus, I've also profiled git-add on the 1-million file repo, and >>it >> looks like, as you might expect, the time is dominated by reading and >> writing the index. The time for git-add is a couple of seconds. > >Note that the time to write the index itself is also rather small, but >the time needed to sha1 the index when loading and then again when >saving it really hurts. > >(I noticed this while working on the commit-tree topic.) > >-- >Thomas Rast >trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debugging git-commit slowness on a large repo 2011-12-02 23:17 Debugging git-commit slowness on a large repo Joshua Redstone 2011-12-03 0:23 ` Carlos Martín Nieto @ 2011-12-04 13:54 ` Tomas Carnecky 1 sibling, 0 replies; 16+ messages in thread From: Tomas Carnecky @ 2011-12-04 13:54 UTC (permalink / raw) To: Joshua Redstone; +Cc: git@vger.kernel.org On 12/3/11 12:17 AM, Joshua Redstone wrote: > Hi, > I have a git repo with about 300k commits, 150k files totaling maybe 7GB. > Locally committing a small change - say touching fewer than 300 bytes > across 4 files - consistently takes over one second, which seems kinda > slow. This is using git 1.7.7.4 on a linux 2.6 box. The time does not > improve after doing a git-gc (my .git dir has maybe 250 files after a git > gc). The same size commit on a brand new repo takes< 10ms. Any thoughts > on why committing a small change seems to take a long time on larger repos? > > Fwiw, I also tried doing the same test using libgit2 (via the pygit2 > wrapper), and it was ever slower (about 6 seconds to commit the same small > change). try git commit --no-status ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-12-20 19:28 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-02 23:17 Debugging git-commit slowness on a large repo Joshua Redstone 2011-12-03 0:23 ` Carlos Martín Nieto 2011-12-05 17:38 ` Junio C Hamano 2011-12-07 1:48 ` Joshua Redstone 2011-12-07 2:08 ` Nguyen Thai Ngoc Duy 2011-12-07 22:48 ` Joshua Redstone 2011-12-08 1:39 ` Nguyen Thai Ngoc Duy 2011-12-09 0:09 ` Joshua Redstone 2011-12-09 0:17 ` Joshua Redstone 2011-12-13 0:15 ` Joshua Redstone 2011-12-20 0:51 ` Joshua Redstone 2011-12-20 1:21 ` Junio C Hamano 2011-12-20 1:40 ` Joshua Redstone 2011-12-20 9:23 ` Thomas Rast 2011-12-20 19:26 ` Joshua Redstone 2011-12-04 13:54 ` Tomas Carnecky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).