* git status: small difference between stating whole repository and small subdirectory @ 2012-02-10 9:42 Piotr Krukowiecki 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy 2012-02-10 16:18 ` Piotr Krukowiecki 0 siblings, 2 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-10 9:42 UTC (permalink / raw) To: Git Mailing List Hi, I compared stating whole tree vs one small subdirectory, and I expected that for the subdirectory status will be very very fast. After all, it has only few files to stat. But it's not fast. Why? With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches): $ time git status > /dev/null real 0m41.670s user 0m0.980s sys 0m2.908s $ time git status -- src/.../somedir > /dev/null real 0m17.380s user 0m0.748s sys 0m0.328s With warm cache: $ time git status > /dev/null real 0m0.792s user 0m0.404s sys 0m0.384s $ time git status -- src/.../somedir > /dev/null real 0m0.335s user 0m0.288s sys 0m0.048s Repository/dir stats: $ find * -type f | wc -l 127189 $ du -shc * | grep total 2.2G total $ find src/.../somedir -type f | wc -l 55 $ du -shc src/.../somedir | grep total 224K total $ git --version git version 1.7.9.rc0.10.gbeecc -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki @ 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy 2012-02-10 13:46 ` Piotr Krukowiecki 2012-02-10 16:18 ` Piotr Krukowiecki 1 sibling, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-10 12:33 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Git Mailing List On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki <piotr.krukowiecki@gmail.com> wrote: > Hi, > > I compared stating whole tree vs one small subdirectory, and I > expected that for the subdirectory status will be very very fast. > After all, it has only few files to stat. But it's not fast. Why? Because stat'ing is not the only thing git-status does? In order to find out staged changes, unstaged changes and untracked files, it has to do the equivalence of "git diff --cached", "git diff" and "git ls-files -o". I think copy detection is also enabled, which uses more cycles. Profiling it should give you a good idea what parts cost most. -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy @ 2012-02-10 13:46 ` Piotr Krukowiecki 2012-02-10 14:37 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-10 13:46 UTC (permalink / raw) To: Git Mailing List; +Cc: Nguyen Thai Ngoc Duy On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki > <piotr.krukowiecki@gmail.com> wrote: >> Hi, >> >> I compared stating whole tree vs one small subdirectory, and I >> expected that for the subdirectory status will be very very fast. >> After all, it has only few files to stat. But it's not fast. Why? > > Because stat'ing is not the only thing git-status does? In order to > find out staged changes, unstaged changes and untracked files, it has > to do the equivalence of "git diff --cached", "git diff" and "git > ls-files -o". I think copy detection is also enabled, which uses more > cycles. I believe copy detection is not done, neither for tracked nor untracked files. Rename detection is done for tracked files. In this case it should not matter, as there were no changes to the files. My point is, that for such small number of small files (55 files and 223KB), which had no changes at all, the status took a lot of time (17 seconds) and doing status on whole repository which has more than 2000x files and 10000x data took only 2x more time. I don't think that the algorithm scales so well, so my guess is that 'status' is so inefficient for subdirectories (i.e. the "git status -- dir" filter does not filter as much as it could). > Profiling it should give you a good idea what parts cost most. I'll try that later. BTW by stating I meant using "status", not that it uses stat() or whatever. -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 13:46 ` Piotr Krukowiecki @ 2012-02-10 14:37 ` Nguyen Thai Ngoc Duy 2012-02-13 16:54 ` Piotr Krukowiecki 0 siblings, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-10 14:37 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Git Mailing List On Fri, Feb 10, 2012 at 8:46 PM, Piotr Krukowiecki <piotr.krukowiecki@gmail.com> wrote: > On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: >> On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki >> <piotr.krukowiecki@gmail.com> wrote: >>> Hi, >>> >>> I compared stating whole tree vs one small subdirectory, and I >>> expected that for the subdirectory status will be very very fast. >>> After all, it has only few files to stat. But it's not fast. Why? >> >> Because stat'ing is not the only thing git-status does? In order to >> find out staged changes, unstaged changes and untracked files, it has >> to do the equivalence of "git diff --cached", "git diff" and "git >> ls-files -o". I think copy detection is also enabled, which uses more >> cycles. > > I believe copy detection is not done, neither for tracked nor untracked files. > Rename detection is done for tracked files. In this case it should not > matter, as there were no changes to the files. > > My point is, that for such small number of small files (55 files and > 223KB), which had no changes at all, the status took a lot of time (17 > seconds) and doing status on whole repository which has more than > 2000x files and 10000x data took only 2x more time. > > I don't think that the algorithm scales so well, so my guess is that > 'status' is so inefficient for subdirectories (i.e. the "git status -- > dir" filter does not filter as much as it could). I think the cost is in $GIT_DIR, not the working directory. -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 14:37 ` Nguyen Thai Ngoc Duy @ 2012-02-13 16:54 ` Piotr Krukowiecki 0 siblings, 0 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-13 16:54 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List On Fri, Feb 10, 2012 at 3:37 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > I think the cost is in $GIT_DIR, not the working directory. Could you explain? I got the problem again today. This time I've made a copy of the repository so hopefully I'll able to reproduce the problems. This time it's a different repository but the 'status' on a small subdirectory is even more than 2x slower than on the whole repository. Whole repo: $ find * -type f | wc -l 33021 $ du -shc * | grep total 2.1G total The subdir: $ find * -type f | wc -l 17 $ du -shc * | grep total 84K total As previously, timing was done with cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches) and executed several times. This time I have used recent git (1.7.9.188.g12766) compiled with -pg. git was executed in the subdirectory. Tracked files were not changed/deleted, there was just a couple of small untracked files. Timings: $ time git status real 0m16.595s user 0m0.680s sys 0m0.616s $ time git status -- . real 0m10.030s user 0m0.464s sys 0m0.184s You can find gprof output here: http://pastebin.com/mhddDUmv - from whole repo status http://pastebin.com/1LdVn77A - from subdir status -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy @ 2012-02-10 16:18 ` Piotr Krukowiecki 2012-02-14 11:34 ` Thomas Rast 1 sibling, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-10 16:18 UTC (permalink / raw) To: Git Mailing List; +Cc: Nguyen Thai Ngoc Duy On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki <piotr.krukowiecki@gmail.com> wrote: > Hi, > > I compared stating whole tree vs one small subdirectory, and I > expected that for the subdirectory status will be very very fast. > After all, it has only few files to stat. But it's not fast. Why? > > > With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches): > > $ time git status > /dev/null > real 0m41.670s > user 0m0.980s > sys 0m2.908s > > $ time git status -- src/.../somedir > /dev/null > real 0m17.380s > user 0m0.748s > sys 0m0.328s > > > With warm cache: > > $ time git status > /dev/null > real 0m0.792s > user 0m0.404s > sys 0m0.384s > > $ time git status -- src/.../somedir > /dev/null > real 0m0.335s > user 0m0.288s > sys 0m0.048s > > > Repository/dir stats: > > $ find * -type f | wc -l > 127189 > $ du -shc * | grep total > 2.2G total > > $ find src/.../somedir -type f | wc -l > 55 > $ du -shc src/.../somedir | grep total > 224K total > > > $ git --version > git version 1.7.9.rc0.10.gbeecc I can't reproduce this behavior at the moment. 'status' on the directory takes about 1.5s instead of 17s. status on whole repository takes 27s. This is my work repository, so it was changed today. I'll get back to you when I can reproduce the problem again... -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-10 16:18 ` Piotr Krukowiecki @ 2012-02-14 11:34 ` Thomas Rast 2012-02-15 8:57 ` Piotr Krukowiecki 0 siblings, 1 reply; 52+ messages in thread From: Thomas Rast @ 2012-02-14 11:34 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Git Mailing List, Nguyen Thai Ngoc Duy Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes: > On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki > <piotr.krukowiecki@gmail.com> wrote: >> I compared stating whole tree vs one small subdirectory, and I >> expected that for the subdirectory status will be very very fast. >> After all, it has only few files to stat. But it's not fast. Why? >> >> >> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches): >> >> $ time git status > /dev/null >> real 0m41.670s >> user 0m0.980s >> sys 0m2.908s >> >> $ time git status -- src/.../somedir > /dev/null >> real 0m17.380s >> user 0m0.748s >> sys 0m0.328s [...] > I can't reproduce this behavior at the moment. 'status' on the > directory takes about 1.5s instead of 17s. status on whole repository > takes 27s. > This is my work repository, so it was changed today. To me these timings smell like a combination of either a network filesystem or a slow/busy disk, and non-packed repositories. Next time this happens look at 'git count-objects', run 'git gc' and redo the timings. If you are indeed on a network filesystem, also look at the core.preloadIndex setting. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-14 11:34 ` Thomas Rast @ 2012-02-15 8:57 ` Piotr Krukowiecki 2012-02-15 11:01 ` Nguyen Thai Ngoc Duy 2012-02-15 19:03 ` Jeff King 0 siblings, 2 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-15 8:57 UTC (permalink / raw) To: Thomas Rast; +Cc: Git Mailing List, Nguyen Thai Ngoc Duy On Tue, Feb 14, 2012 at 12:34 PM, Thomas Rast <trast@inf.ethz.ch> wrote: > Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes: > >> On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki >> <piotr.krukowiecki@gmail.com> wrote: >>> I compared stating whole tree vs one small subdirectory, and I >>> expected that for the subdirectory status will be very very fast. >>> After all, it has only few files to stat. But it's not fast. Why? >>> >>> >>> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches): >>> >>> $ time git status > /dev/null >>> real 0m41.670s >>> user 0m0.980s >>> sys 0m2.908s >>> >>> $ time git status -- src/.../somedir > /dev/null >>> real 0m17.380s >>> user 0m0.748s >>> sys 0m0.328s > [...] >> I can't reproduce this behavior at the moment. 'status' on the >> directory takes about 1.5s instead of 17s. status on whole repository >> takes 27s. >> This is my work repository, so it was changed today. > > To me these timings smell like a combination of either a network > filesystem or a slow/busy disk, and non-packed repositories. Next time > this happens look at 'git count-objects', run 'git gc' and redo the > timings. > > If you are indeed on a network filesystem, also look at the > core.preloadIndex setting. All is on local disk and system is idle. Indeed, after gc the times went down: 10s -> 2.3s (subdirectory) 17s -> 9.5s (whole repo) 2 seconds is much better and I'd say acceptable for me. But my questions are: - why is it so slow with not packed repo? - can it be faster without repacking? - even with packed repo, the time on small subdirectory is much higher than I'd expect given time on whole repo and subdirectory size - why? $ git count-objects -v count: 5095 size: 37084 in-pack: 755364 packs: 21 size-pack: 2398468 prune-packable: 0 garbage: 0 $ git gc Counting objects: 760212, done. Compressing objects: 100% (158651/158651), done. Writing objects: 100% (760212/760212), done. Total 760212 (delta 535848), reused 736108 (delta 513257) Checking connectivity: 760212, done. $ time git status -- . real 0m2.503s user 0m0.160s sys 0m0.096s $ time git status real 0m9.663s user 0m0.232s sys 0m0.556s -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-15 8:57 ` Piotr Krukowiecki @ 2012-02-15 11:01 ` Nguyen Thai Ngoc Duy 2012-02-15 15:14 ` Piotr Krukowiecki 2012-02-15 19:03 ` Jeff King 1 sibling, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-15 11:01 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki <piotr.krukowiecki@gmail.com> wrote: > Indeed, after gc the times went down: > 10s -> 2.3s (subdirectory) > 17s -> 9.5s (whole repo) > > 2 seconds is much better and I'd say acceptable for me. But my questions are: > - why is it so slow with not packed repo? > - can it be faster without repacking? gc does more than just repacking. If you still have the un-gc'd repo, Try these commands one by one, and time "git status" after each: - git pack-refs --all --prune - git reflog expire --all - git repack -d -l - git prune --expire - git rerere gc I'd be more interested in why auto-gc does not kick in (or whther it should). > - even with packed repo, the time on small subdirectory is much higher > than I'd expect given time on whole repo and subdirectory size - why? Hard to say without measuring. I just notice that I missed your mail with profiling results. I will have a look, but just in case, is the repository publicly available? -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-15 11:01 ` Nguyen Thai Ngoc Duy @ 2012-02-15 15:14 ` Piotr Krukowiecki 2012-02-16 13:22 ` Piotr Krukowiecki 0 siblings, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-15 15:14 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, Git Mailing List On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki > <piotr.krukowiecki@gmail.com> wrote: >> Indeed, after gc the times went down: >> 10s -> 2.3s (subdirectory) >> 17s -> 9.5s (whole repo) >> >> 2 seconds is much better and I'd say acceptable for me. But my questions are: >> - why is it so slow with not packed repo? >> - can it be faster without repacking? > > gc does more than just repacking. If you still have the un-gc'd repo, > Try these commands one by one, and time "git status" after each: > > - git pack-refs --all --prune > - git reflog expire --all > - git repack -d -l > - git prune --expire > - git rerere gc It will take some time but hopefully I'll have the stats for tomorrow. > I'd be more interested in why auto-gc does not kick in (or whther it should). I don't have any specific options set, so default values should be used. I'm using git-svn though, so my workflow looks like this: git svn fetch + git svn rebase ... git operations like commit, cherry-pick, rebase ... git svn dcommit Not sure if that matters. I remember that I've seen auto-gc being run several times in the past - I think after svn fetch/rebase. I'm also using git-new-workdir and have 2 extra workdirs. >> - even with packed repo, the time on small subdirectory is much higher >> than I'd expect given time on whole repo and subdirectory size - why? > > Hard to say without measuring. I just notice that I missed your mail > with profiling results. I will have a look, but just in case, is the > repository publicly available? Unfortunately it's not public. I can do some measuring if someone tells me what to do. -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-15 15:14 ` Piotr Krukowiecki @ 2012-02-16 13:22 ` Piotr Krukowiecki 0 siblings, 0 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-16 13:22 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, Git Mailing List, Jeff King On Wed, Feb 15, 2012 at 4:14 PM, Piotr Krukowiecki <piotr.krukowiecki@gmail.com> wrote: > On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy > <pclouds@gmail.com> wrote: >> On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki >> <piotr.krukowiecki@gmail.com> wrote: >>> Indeed, after gc the times went down: >>> 10s -> 2.3s (subdirectory) >>> 17s -> 9.5s (whole repo) >>> >>> 2 seconds is much better and I'd say acceptable for me. But my questions are: >>> - why is it so slow with not packed repo? >>> - can it be faster without repacking? >> >> gc does more than just repacking. If you still have the un-gc'd repo, >> Try these commands one by one, and time "git status" after each: >> >> - git pack-refs --all --prune >> - git reflog expire --all >> - git repack -d -l >> - git prune --expire >> - git rerere gc > > It will take some time but hopefully I'll have the stats for tomorrow. Here they are. I did 'status' three times to get reliable results and before each run have dropped caches. Backed up repository was copied before each 'status'. Full log is at http://pastebin.com/VmB7J9CJ git version 1.7.9.rc0.10.gbeecc Results after each command: status on whole repo: 18.5s - after git count-objects -v 16.0s - after git pack-refs --all --prune 20.2s - after git reflog expire --all 13.0s - after git repack -d -l 16.8s - after git prune --expire now 19.7s - after git rerere gc status on subdir: 9.7s - after git count-objects -v 9.2s - after git pack-refs --all --prune 9.3s - after git reflog expire --all 4.4s - after git repack -d -l 9.2s - after git prune --expire now 9.0s - after git rerere gc -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-15 8:57 ` Piotr Krukowiecki 2012-02-15 11:01 ` Nguyen Thai Ngoc Duy @ 2012-02-15 19:03 ` Jeff King 2012-02-16 13:37 ` Piotr Krukowiecki 1 sibling, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-15 19:03 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote: > All is on local disk and system is idle. > > Indeed, after gc the times went down: > 10s -> 2.3s (subdirectory) > 17s -> 9.5s (whole repo) > > 2 seconds is much better and I'd say acceptable for me. But my questions are: Obviously these answers didn't come from any deep analysis, but are educated guesses from me based on previous performance patterns we've seen on the list: > - why is it so slow with not packed repo? Your numbers show that you're I/O-bound: >>> $ time git status > /dev/null >>> real 0m41.670s >>> user 0m0.980s >>> sys 0m2.908s >>> >>> $ time git status -- src/.../somedir > /dev/null >>> real 0m17.380s >>> user 0m0.748s >>> sys 0m0.328s which is not surprising, since you said you dropped caches before-hand. Repacking probably reduced your disk footprint by a lot, which meant less I/O. I notice that you're still I/O bound even after the repack: > $ time git status -- . > real 0m2.503s > user 0m0.160s > sys 0m0.096s > > $ time git status > real 0m9.663s > user 0m0.232s > sys 0m0.556s Did you drop caches here, too? Usually that would not be the case on a warm cache. If it is, then it sounds like you are short on memory to actually hold the directory tree and object db in cache. If not, what do the warm cache numbers look like? > - can it be faster without repacking? Not really. You're showing an I/O problem, and repacking is git's way of reducing I/O. > - even with packed repo, the time on small subdirectory is much higher > than I'd expect given time on whole repo and subdirectory size - why? Hard to say without profiling. It may be that we reduced the object db lookups, saving some time, but still end up stat()ing the whole tree. The optimization to stat only the directories of interest was in 688cd6d (status: only touch path we may need to check, 2010-01-14), which went into v1.7.0. What version of git are you using? -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-15 19:03 ` Jeff King @ 2012-02-16 13:37 ` Piotr Krukowiecki 2012-02-16 14:05 ` Thomas Rast 2012-02-16 19:20 ` Jeff King 0 siblings, 2 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-16 13:37 UTC (permalink / raw) To: Jeff King; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote: > On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote: >> > I notice that you're still I/O bound even after the repack: > >> $ time git status -- . >> real 0m2.503s >> user 0m0.160s >> sys 0m0.096s >> >> $ time git status >> real 0m9.663s >> user 0m0.232s >> sys 0m0.556s > > Did you drop caches here, too? Yes I did - with cache the status takes something like 0.1-0.3s on whole repo. > Usually that would not be the case on a > warm cache. If it is, then it sounds like you are short on memory to > actually hold the directory tree and object db in cache. If not, what do > the warm cache numbers look like? I've got 4GB of ram and I did not hit the swap when doing last performance tests AFAIK. Please see my previous posts for performance results with warm cache and profile results: http://article.gmane.org/gmane.comp.version-control.git/190397 http://article.gmane.org/gmane.comp.version-control.git/190638 >> - can it be faster without repacking? > > Not really. You're showing an I/O problem, and repacking is git's way of > reducing I/O. So if I understand correctly, the reason is because git must compare workspace files with packed objects - and the problem is reading/seeking/searching in the packs? Is there a way to make packs better? I think most operations are on workdir files - so maybe it'd be possible to tell gc/repack/whatever to optimize access to files which I currently have in workdir? >> - even with packed repo, the time on small subdirectory is much higher >> than I'd expect given time on whole repo and subdirectory size - why? > > Hard to say without profiling. It may be that we reduced the object db > lookups, saving some time, but still end up stat()ing the whole tree. > The optimization to stat only the directories of interest was in 688cd6d > (status: only touch path we may need to check, 2010-01-14), which went > into v1.7.0. What version of git are you using? For latest tests I've used 1.7.9.rc0.10.gbeecc, for profiling - 1.7.9.188.g12766 Is there anything else I could do? -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-16 13:37 ` Piotr Krukowiecki @ 2012-02-16 14:05 ` Thomas Rast 2012-02-16 20:15 ` Junio C Hamano 2012-02-17 16:55 ` Piotr Krukowiecki 2012-02-16 19:20 ` Jeff King 1 sibling, 2 replies; 52+ messages in thread From: Thomas Rast @ 2012-02-16 14:05 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes: > On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote: >> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote: >>> >> I notice that you're still I/O bound even after the repack: >> >>> $ time git status -- . >>> real 0m2.503s >>> user 0m0.160s >>> sys 0m0.096s >>> >>> $ time git status >>> real 0m9.663s >>> user 0m0.232s >>> sys 0m0.556s >> >> Did you drop caches here, too? > > Yes I did - with cache the status takes something like 0.1-0.3s on whole repo. So umm, I'm not sure that leaves anything to be improved. I looked at some strace dumps, and limiting the status to a subdirectory (in my case, '-- t' in git.git) does omit the lstat()s on uninteresting parts of the index-listed files, as well as the getdents() (i.e., readdir()) for parts of the tree that are not interesting. BTW, some other parts of git-status's display may be responsible for the amount of data it pulls from disk. In particular, the "Your branch is ahead" display requires computing the merge-base between HEAD and @{upstream}. If your @{upstream} is way ahead/behind, or points at a disjoint chunk of history, this may mean essentially pulling all of the involved history from disk. If my memory of pack organization serves right, the commit objects involved would essentially be spread across the whole pack (corresponding to "time") and thus this operation would more or less load the entire pack from disk. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-16 14:05 ` Thomas Rast @ 2012-02-16 20:15 ` Junio C Hamano 2012-02-17 16:55 ` Piotr Krukowiecki 1 sibling, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2012-02-16 20:15 UTC (permalink / raw) To: Thomas Rast Cc: Piotr Krukowiecki, Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy Thomas Rast <trast@inf.ethz.ch> writes: > ... If my memory of pack organization serves > right, the commit objects involved would essentially be spread across > the whole pack No they are crumped together in a contiguous section in a packfile, so that "git log" without any pathspec can go faster without consulting tree objects. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-16 14:05 ` Thomas Rast 2012-02-16 20:15 ` Junio C Hamano @ 2012-02-17 16:55 ` Piotr Krukowiecki 1 sibling, 0 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-17 16:55 UTC (permalink / raw) To: Thomas Rast; +Cc: Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy On Thu, Feb 16, 2012 at 3:05 PM, Thomas Rast <trast@inf.ethz.ch> wrote: > Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes: > >> On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote: >>> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote: >>>> >>> I notice that you're still I/O bound even after the repack: >>> >>>> $ time git status -- . >>>> real 0m2.503s >>>> user 0m0.160s >>>> sys 0m0.096s >>>> >>>> $ time git status >>>> real 0m9.663s >>>> user 0m0.232s >>>> sys 0m0.556s >>> >>> Did you drop caches here, too? >> >> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo. > > So umm, I'm not sure that leaves anything to be improved. But even with caches time on small directory is only half of time on whole repo: 0.15s vs 0.07s > I looked at some strace dumps, and limiting the status to a subdirectory > (in my case, '-- t' in git.git) does omit the lstat()s on uninteresting > parts of the index-listed files, as well as the getdents() (i.e., > readdir()) for parts of the tree that are not interesting. Same in my case. I've run strace with -c which shows system calls times. As I understand the results the time used by lstat() is very small. With dropped caches: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 90.70 0.065108 3 25361 12 lstat 6.78 0.004869 1 6534 getdents 1.94 0.001392 0 6964 3238 open 0.27 0.000194 0 3726 close > BTW, some other parts of git-status's display may be responsible for the > amount of data it pulls from disk. In particular, the "Your branch is > ahead" display requires computing the merge-base between HEAD and > @{upstream}. If your @{upstream} is way ahead/behind, or points at a > disjoint chunk of history, this may mean essentially pulling all of the > involved history from disk. If my memory of pack organization serves > right, the commit objects involved would essentially be spread across > the whole pack (corresponding to "time") and thus this operation would > more or less load the entire pack from disk. I don't think this is the case - I'm using git-svn and thus have no upstream in git meaning. -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-16 13:37 ` Piotr Krukowiecki 2012-02-16 14:05 ` Thomas Rast @ 2012-02-16 19:20 ` Jeff King 2012-02-17 17:19 ` Piotr Krukowiecki 1 sibling, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-16 19:20 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote: > >> $ time git status -- . > >> real 0m2.503s > >> user 0m0.160s > >> sys 0m0.096s > >> > >> $ time git status > >> real 0m9.663s > >> user 0m0.232s > >> sys 0m0.556s > > > > Did you drop caches here, too? > > Yes I did - with cache the status takes something like 0.1-0.3s on whole repo. OK, then that makes sense. It's pretty much just I/O on the filesystem and on the object db. You can break status down a little more to see which is which. Try "git update-index --refresh" to see just how expensive the lstat and index handling is. And then try "git diff-index HEAD" for an idea of how expensive it is to just read the objects and compare to the index. > > Not really. You're showing an I/O problem, and repacking is git's way of > > reducing I/O. > > So if I understand correctly, the reason is because git must compare > workspace files with packed objects - and the problem is > reading/seeking/searching in the packs? Mostly reading (we keep a sorted index and access the packfiles via mmap, so we only touch the pages we need). But you're also paying to lstat() the directory tree, too. And you're paying to load (probably) the whole index into memory, although it's relatively compact compared to the actual file data. > Is there a way to make packs better? I think most operations are on > workdir files - so maybe it'd be possible to tell gc/repack/whatever > to optimize access to files which I currently have in workdir? It already does optimize for that case. If you can make it even better, I'm sure people would be happy to see the numbers. Mostly I think it is just the case that disk I/O is slow, and the operation you're asking for has to do a certain amount of it. What kind of disk/filesystem are you pulling off of? It's not a fuse filesystem by any chance, is it? I have a repo on an encfs-mounted filesystem, and the lstat times are absolutely horrific. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-16 19:20 ` Jeff King @ 2012-02-17 17:19 ` Piotr Krukowiecki 2012-02-17 20:37 ` Jeff King 0 siblings, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-17 17:19 UTC (permalink / raw) To: Jeff King; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Thu, Feb 16, 2012 at 8:20 PM, Jeff King <peff@peff.net> wrote: > On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote: > >> >> $ time git status -- . >> >> real 0m2.503s >> >> user 0m0.160s >> >> sys 0m0.096s >> >> >> >> $ time git status >> >> real 0m9.663s >> >> user 0m0.232s >> >> sys 0m0.556s >> > >> > Did you drop caches here, too? >> >> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo. > > OK, then that makes sense. It's pretty much just I/O on the filesystem > and on the object db. > > You can break status down a little more to see which is which. Try "git > update-index --refresh" to see just how expensive the lstat and index > handling is. "git update-index --refresh" with dropped cache took real 0m3.726s user 0m0.024s sys 0m0.404s while "git status" with dropped cache takes real 0m13.578s user 0m0.240s sys 0m0.600s I'm not sure why it takes more than the 9s reported before - IIRC I did the previous test in single mode under bare shell and this time I'm testing under gnome. This or it's the effect of running update-index :/ Now status on subdir takes 9.5s. But still the not-much-faster-status-on-subdir rule is true. > And then try "git diff-index HEAD" for an idea of how expensive it is to > just read the objects and compare to the index. The diff-index after dropping cache takes real 0m14.095s user 0m0.268s sys 0m0.564s >> > Not really. You're showing an I/O problem, and repacking is git's way of >> > reducing I/O. >> >> So if I understand correctly, the reason is because git must compare >> workspace files with packed objects - and the problem is >> reading/seeking/searching in the packs? > > Mostly reading (we keep a sorted index and access the packfiles via > mmap, so we only touch the pages we need). But you're also paying to > lstat() the directory tree, too. And you're paying to load (probably) > the whole index into memory, although it's relatively compact compared > to the actual file data. If the index is the objects/pack/*.idx files than it's 21MB >> Is there a way to make packs better? I think most operations are on >> workdir files - so maybe it'd be possible to tell gc/repack/whatever >> to optimize access to files which I currently have in workdir? > > It already does optimize for that case. If you can make it even better, > I'm sure people would be happy to see the numbers. If I understand correctly, you only need to compute sha1 on the workdir files and compare it with sha1 files recorded in index/gitdir. It seems that to get the sha1 from index/gitdir I need to read the packfiles? Maybe it'd be possible to cache/index it somehow, for example in separate and smaller file? > Mostly I think it is just the case that disk I/O is slow, and the > operation you're asking for has to do a certain amount of it. What kind > of disk/filesystem are you pulling off of? > > It's not a fuse filesystem by any chance, is it? I have a repo on an > encfs-mounted filesystem, and the lstat times are absolutely horrific. No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it reads on the cover :) But IMO faster disk won't help with this - times will be smaller, but you'll still have to read the same data, so the subdir times will be just 2x faster than whole repo, won't it? So maybe in my case it will go down to e.g. 2s on subdir, but for someone with larger repository it will still be 10s... -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-17 17:19 ` Piotr Krukowiecki @ 2012-02-17 20:37 ` Jeff King 2012-02-17 22:25 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-17 20:37 UTC (permalink / raw) To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Fri, Feb 17, 2012 at 06:19:06PM +0100, Piotr Krukowiecki wrote: > "git update-index --refresh" with dropped cache took > real 0m3.726s > user 0m0.024s > sys 0m0.404s > [...] > The diff-index after dropping cache takes > real 0m14.095s > user 0m0.268s > sys 0m0.564s OK, that suggests to me that the real culprit is the I/O we spend in accessing the object db, since that is the main I/O that happens in the second command but not the first. > > Mostly reading (we keep a sorted index and access the packfiles via > > mmap, so we only touch the pages we need). But you're also paying to > > lstat() the directory tree, too. And you're paying to load (probably) > > the whole index into memory, although it's relatively compact compared > > to the actual file data. > > If the index is the objects/pack/*.idx files than it's 21MB Yes, that's it. Though we don't necessarily read the whole thing. The sorted list of sha1s is only a part of that. And we mmap and binary-search that, so we only have to fault in pages that are actually used in our binary search. However, we're faulting in random pages of the index in series, so it may actually have a lot of latency. You can see how expensive the I/O on the index is with something like this: [whole operation, for reference] $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' $ time git diff-index HEAD real 0m2.636s user 0m0.248s sys 0m0.392s [prime the cache with just the index] $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' $ time cat .git/objects/pack/*.idx >/dev/null real 0m0.288s user 0m0.000s sys 0m0.028s $ time git diff-index HEAD real 0m2.175s user 0m0.272s sys 0m0.320s So roughly 20% of the I/O time in my case went to faulting in the index. You could pre-fault in the index, which would give the OS a chance to do read-ahead caching. You can see that the combined cat and diff-index times are still lower than the raw diff-index time. You could also do them in parallel, but that will create some additional seeks as the threads fight for the disk, but may be a win in the long run because we can read bigger chunks. You can roughly simulate it by running the "cat" and the "diff-index" above in parallel. I get: real 0m2.464s user 0m0.284s sys 0m0.372s which is almost exactly the same as doing them separately (though note that this is on an SSD, so seeking is very cheap). But the bulk of the time still goes to actually retrieving the object data, so that's probably a more interesting area to focus, anyway (and if we can reduce object accesses, we reduce their lookup, too :) ). > If I understand correctly, you only need to compute sha1 on the > workdir files and compare it with sha1 files recorded in index/gitdir. > It seems that to get the sha1 from index/gitdir I need to read the > packfiles? Maybe it'd be possible to cache/index it somehow, for > example in separate and smaller file? There are two diffs going on in "git status". One is a comparison between index and worktree. In that one, you need to lstat each file to make sure the cached sha1 we have in the index is up to date. Assuming it is, you don't need to touch the file data at all. Then you compare that sha1 to the stage 0 sha1 (i.e., what we typically think of as "staged for commit"). If they match, you don't need to do more work. But the expensive diff-index we've been doing above is comparing the index to the HEAD tree. And doing that is a little trickier. The index is a flat list of files with their sha1s. But the HEAD tree is stored hierarchically. So to get the sha1 of foo/bar/baz, we have to access the root tree object, find the "foo" entry, access its tree object, find the "bar" entry, access its tree object, and then find the "baz" entry. Then we compare the sha1 of the "baz" entry to what's in the index. So what's where your I/O comes from: accessing each of the tree objects. And that fact that it isn't just "compare the HEAD and index sha1s" is that the index is stored as a list of flat files. That being said, we do have an index extension to store the tree sha1 of whole directories (i.e., we populate it when we write a whole tree or subtree into the index from the object db, and it becomes invalidated when a file becomes modified). This optimization is used by things like "git commit" to avoid having to recreate the same sub-trees over and over when creating tree objects from the index. But we could also use it here to avoid having to even read the sub-tree objects from the object db. > No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it > reads on the cover :) > > But IMO faster disk won't help with this - times will be smaller, but > you'll still have to read the same data, so the subdir times will be > just 2x faster than whole repo, won't it? So maybe in my case it will > go down to e.g. 2s on subdir, but for someone with larger repository > it will still be 10s... Sure. But a certain amount of I/O is going to be unavoidable to get the answer to your question. So you will never be able to achieve the warm-cache case. I'm not saying we can't improve (e.g., I think the index extension thing I mentioned above is a promising approach). But we have to be realistic about what will make things faster; if I/O is your problem, faster disk is one possible solution (especially because some of this is related to seeking and latency, an SSD is a nice improvement for cold-cache times). -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-17 20:37 ` Jeff King @ 2012-02-17 22:25 ` Junio C Hamano 2012-02-17 22:29 ` Jeff King 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-17 22:25 UTC (permalink / raw) To: Jeff King Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy Jeff King <peff@peff.net> writes: > That being said, we do have an index extension to store the tree sha1 of > whole directories (i.e., we populate it when we write a whole tree or > subtree into the index from the object db, and it becomes invalidated > when a file becomes modified). This optimization is used by things like > "git commit" to avoid having to recreate the same sub-trees over and > over when creating tree objects from the index. But we could also use it > here to avoid having to even read the sub-tree objects from the object > db. Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20) perhaps? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-17 22:25 ` Junio C Hamano @ 2012-02-17 22:29 ` Jeff King 2012-02-20 8:25 ` Piotr Krukowiecki 0 siblings, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-17 22:29 UTC (permalink / raw) To: Junio C Hamano Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > That being said, we do have an index extension to store the tree sha1 of > > whole directories (i.e., we populate it when we write a whole tree or > > subtree into the index from the object db, and it becomes invalidated > > when a file becomes modified). This optimization is used by things like > > "git commit" to avoid having to recreate the same sub-trees over and > > over when creating tree objects from the index. But we could also use it > > here to avoid having to even read the sub-tree objects from the object > > db. > > Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20) > perhaps? That's what I get for speaking before running "git log". So yeah, we may be about as reasonably fast as we can go. Or maybe that optimization isn't kicking in for some reason. I think going further would require Piotr to do more profiling. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-17 22:29 ` Jeff King @ 2012-02-20 8:25 ` Piotr Krukowiecki 2012-02-20 14:06 ` Jeff King 0 siblings, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-20 8:25 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Fri, Feb 17, 2012 at 11:29 PM, Jeff King <peff@peff.net> wrote: > On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >> > That being said, we do have an index extension to store the tree sha1 of >> > whole directories (i.e., we populate it when we write a whole tree or >> > subtree into the index from the object db, and it becomes invalidated >> > when a file becomes modified). This optimization is used by things like >> > "git commit" to avoid having to recreate the same sub-trees over and >> > over when creating tree objects from the index. But we could also use it >> > here to avoid having to even read the sub-tree objects from the object >> > db. >> >> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20) >> perhaps? > > That's what I get for speaking before running "git log". > > So yeah, we may be about as reasonably fast as we can go. Or maybe that > optimization isn't kicking in for some reason. I think going further > would require Piotr to do more profiling. Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses test-dump-cache-tree and executes "read-tree HEAD" to establish the cache, but in my case read-tree does not make the cache dumpable (but it improves status performance). $ test-dump-cache-tree | wc -l 0 $ git read-tree HEAD $ test-dump-cache-tree | wc -l 0 $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- . [...] real 0m1.085s git version 1.7.9.188.g12766 -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 8:25 ` Piotr Krukowiecki @ 2012-02-20 14:06 ` Jeff King 2012-02-20 14:09 ` Thomas Rast ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Jeff King @ 2012-02-20 14:06 UTC (permalink / raw) To: Piotr Krukowiecki Cc: Junio C Hamano, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote: > Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses > test-dump-cache-tree and executes "read-tree HEAD" to establish the > cache, but in my case read-tree does not make the cache dumpable (but > it improves status performance). > > $ test-dump-cache-tree | wc -l > 0 > $ git read-tree HEAD > $ test-dump-cache-tree | wc -l > 0 > $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- . > [...] > real 0m1.085s Hmm. I would think test-dump-cache-tree would do it. I don't know why read-tree wouldn't fill it in, though. Interestingly, on my git.git repo, I had an empty cache. Running "git read-tree HEAD" filled it (according to test-dump-cache-tree). It seems that running "git checkout" empties the cache. So perhaps git could do better about keeping the cache valid over time. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:06 ` Jeff King @ 2012-02-20 14:09 ` Thomas Rast 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy 2012-02-20 19:57 ` Junio C Hamano 2012-02-20 14:16 ` Nguyen Thai Ngoc Duy 2012-02-20 19:56 ` Junio C Hamano 2 siblings, 2 replies; 52+ messages in thread From: Thomas Rast @ 2012-02-20 14:09 UTC (permalink / raw) To: Jeff King Cc: Piotr Krukowiecki, Junio C Hamano, Git Mailing List, Nguyen Thai Ngoc Duy Jeff King <peff@peff.net> writes: > On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote: > >> Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses >> test-dump-cache-tree and executes "read-tree HEAD" to establish the >> cache, but in my case read-tree does not make the cache dumpable (but >> it improves status performance). >> >> $ test-dump-cache-tree | wc -l >> 0 >> $ git read-tree HEAD >> $ test-dump-cache-tree | wc -l >> 0 >> $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- . >> [...] >> real 0m1.085s > > Hmm. I would think test-dump-cache-tree would do it. I don't know why > read-tree wouldn't fill it in, though. > > Interestingly, on my git.git repo, I had an empty cache. Running "git > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems > that running "git checkout" empties the cache. So perhaps git could do > better about keeping the cache valid over time. test_expect_failure 'checkout gives cache-tree' ' git checkout HEAD^ && test_shallow_cache_tree ' ;-) -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:09 ` Thomas Rast @ 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy 2012-02-20 14:39 ` Jeff King 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy 2012-02-20 19:57 ` Junio C Hamano 1 sibling, 2 replies; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-20 14:36 UTC (permalink / raw) To: Thomas Rast Cc: Jeff King, Piotr Krukowiecki, Junio C Hamano, Git Mailing List On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote: > > Interestingly, on my git.git repo, I had an empty cache. Running "git > > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems > > that running "git checkout" empties the cache. So perhaps git could do > > better about keeping the cache valid over time. > > test_expect_failure 'checkout gives cache-tree' ' > git checkout HEAD^ && > test_shallow_cache_tree > ' > > ;-) Quick and dirty that passes that test. I think we could do better if we analyse two way merge rules carefully and avoid this diff, but that's too much for me right now. -- 8< -- diff --git a/builtin/checkout.c b/builtin/checkout.c index 5bf96ba..c06287a 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts) die(_("diff_setup_done failed")); add_pending_object(&rev, head, NULL); run_diff_index(&rev, 0); + if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) { + struct tree *tree = parse_tree_indirect(head->sha1); + prime_cache_tree(&active_cache_tree, tree); + } } static void describe_detached_head(const char *msg, struct commit *commit) @@ -493,13 +497,13 @@ static int merge_working_tree(struct checkout_opts *opts, } } + if (!opts->force && !opts->quiet) + show_local_changes(&new->commit->object, &opts->diff_options); + if (write_cache(newfd, active_cache, active_nr) || commit_locked_index(lock_file)) die(_("unable to write new index file")); - if (!opts->force && !opts->quiet) - show_local_changes(&new->commit->object, &opts->diff_options); - return 0; } -- 8< -- -- Duy ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy @ 2012-02-20 14:39 ` Jeff King 2012-02-20 15:11 ` Jeff King 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-20 14:39 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Thomas Rast, Piotr Krukowiecki, Junio C Hamano, Git Mailing List On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote: > > test_expect_failure 'checkout gives cache-tree' ' > > git checkout HEAD^ && > > test_shallow_cache_tree > > ' > > > > ;-) > > Quick and dirty that passes that test. I think we could do better if > we analyse two way merge rules carefully and avoid this diff, but > that's too much for me right now. Unpack trees is already sprinkled with cache_tree_invalidate_path. But something seems to throw away the cache tree entirely (I think it may be that the extension simply isn't copied over to the destination index). I'm walking through it right now. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:39 ` Jeff King @ 2012-02-20 15:11 ` Jeff King 2012-02-20 18:45 ` Thomas Rast 2012-02-20 20:08 ` Junio C Hamano 0 siblings, 2 replies; 52+ messages in thread From: Jeff King @ 2012-02-20 15:11 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Thomas Rast, Piotr Krukowiecki, Junio C Hamano, Git Mailing List On Mon, Feb 20, 2012 at 09:39:52AM -0500, Jeff King wrote: > On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote: > > > > test_expect_failure 'checkout gives cache-tree' ' > > > git checkout HEAD^ && > > > test_shallow_cache_tree > > > ' > > > > > > ;-) > > > > Quick and dirty that passes that test. I think we could do better if > > we analyse two way merge rules carefully and avoid this diff, but > > that's too much for me right now. > > Unpack trees is already sprinkled with cache_tree_invalidate_path. But > something seems to throw away the cache tree entirely (I think it may be > that the extension simply isn't copied over to the destination index). > I'm walking through it right now. Hmm. OK, this doesn't pass the test, but I think it is better than the current behavior. Basically, what happens now with "git checkout" is this: 1. read_cache pulls the cache_tree from disk into the_index 2. we call unpack_trees with o->src_index == o->dst_index == &the_index. 3. during tree traversal, unpack_trees callbacks properly calls cache_tree_invalidate_path whenever it updates a path. We write the results into o->result. 4. At the end of unpack_trees, we forget about src_index, and copy o->result into *o->dst_index byte for byte. I.e., we overwrite the_index.cache_tree, which has been properly updated the whole time, with NULL, dropping it entirely (in fact, I believe it even creates a memory leak of the old cache_tree). This one-liner improves that a bit: diff --git a/unpack-trees.c b/unpack-trees.c index 8be3f6c..e8aedea 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -1135,6 +1135,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options } } + o->result.cache_tree = o->src_index->cache_tree; o->src_index = NULL; ret = check_updates(o) ? (-2) : 0; if (o->dst_index) by copying the cache_tree (which has been updated all along) from src_index into the result (which will then make it into the destination index, which of course in this case is the same as the source index, anyway). It makes "git checkout" with no changes just work (since we preserve the cache tree, and it doesn't need updated). It makes something like "git checkout HEAD^" work, keeping most of the cache-tree intact, but invalidating trees containing paths that were modified. But it does not actually insert the _destination_ tree into the cache tree. Which we can do in certain situations, but only if there were no paths in the tree that were left unchanged (e.g., you modify "foo", then "git checkout HEAD^", which updates "bar". Your tree does not match HEAD^, and must be invalidated). While it would be cool to be able to handle those complex cases, making this one simple change covers most of the cases people care about (i.e., leaving the cache-tree intact for all of the directories that weren't touched at all). I think this implementation matches the intent of the original calls to cache_tree_invalidate_path sprinkled throughout unpack-trees.c. But I have to say that it seems a little odd for us to be modifying the o->src_index throughout the whole thing. I would think the right thing would be to make a deep copy of o->src_index->cache_tree into o->result.cache_tree as the very first thing, and then update o->result.cache_tree throughout the tree traversal. There is no point in invalidating src_index's cache_tree, since it is not receiving the updates. In practice, this doesn't tend to matter because everybody just sets src and dst to &the_index anyway. The one exception seems to be diff_cache, which sets dst_index to NULL. But it doesn't matter there, because we are only using oneway_diff as our callback, which does not actually write or invalidate anything in the cache. -Peff ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 15:11 ` Jeff King @ 2012-02-20 18:45 ` Thomas Rast 2012-02-20 20:35 ` Jeff King 2012-02-20 20:08 ` Junio C Hamano 1 sibling, 1 reply; 52+ messages in thread From: Thomas Rast @ 2012-02-20 18:45 UTC (permalink / raw) To: Jeff King Cc: Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Junio C Hamano, Git Mailing List Jeff King <peff@peff.net> writes: > diff --git a/unpack-trees.c b/unpack-trees.c > index 8be3f6c..e8aedea 100644 > --- a/unpack-trees.c > +++ b/unpack-trees.c > @@ -1135,6 +1135,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options > } > } > > + o->result.cache_tree = o->src_index->cache_tree; > o->src_index = NULL; > ret = check_updates(o) ? (-2) : 0; > if (o->dst_index) Brilliant. I know I'm stealing Junio's punchline, but please make it so :-) Browsing around in history, it seems that this was silently broken by 34110cd (Make 'unpack_trees()' have a separate source and destination index, 2008-03-06), which introduced the distinction between source and destination index. Before that they were the same, so the cache tree would have been updated correctly. > It makes "git checkout" with no changes just work (since we preserve the > cache tree, and it doesn't need updated). It makes something like "git > checkout HEAD^" work, keeping most of the cache-tree intact, but > invalidating trees containing paths that were modified. Great. Here's a test you could use. It's a bit noisy because the shallow in test_shallow_cache_tree no longer made any sense, but I think it tests what we want to see. diff --git i/t/t0090-cache-tree.sh w/t/t0090-cache-tree.sh index 6c33e28..5706305 100755 --- i/t/t0090-cache-tree.sh +++ w/t/t0090-cache-tree.sh @@ -16,14 +16,16 @@ cmp_cache_tree () { # We don't bother with actually checking the SHA1: # test-dump-cache-tree already verifies that all existing data is # correct. -test_shallow_cache_tree () { - printf "SHA (%d entries, 0 subtrees)\n" $(git ls-files|wc -l) >expect && +test_cache_tree () { + printf "SHA (%d entries, 1 subtrees)\n" $(git ls-files|wc -l) >expect && + printf "SHA sub/ (%d entries, 0 subtrees)\n" $(git ls-files sub|wc -l) >>expect && cmp_cache_tree expect } test_invalid_cache_tree () { - echo "invalid (0 subtrees)" >expect && - printf "SHA #(ref) (%d entries, 0 subtrees)\n" $(git ls-files|wc -l) >>expect && + echo "invalid (1 subtrees)" >expect && + printf "SHA #(ref) (%d entries, 1 subtrees)\n" $(git ls-files|wc -l) >>expect && + printf "SHA sub/ (%d entries, 0 subtrees)\n" $(git ls-files sub|wc -l) >>expect && cmp_cache_tree expect } @@ -33,13 +35,16 @@ test_no_cache_tree () { } test_expect_failure 'initial commit has cache-tree' ' + mkdir sub && + echo bar > sub/bar && + git add sub/bar && test_commit foo && - test_shallow_cache_tree + test_cache_tree ' test_expect_success 'read-tree HEAD establishes cache-tree' ' git read-tree HEAD && - test_shallow_cache_tree + test_cache_tree ' test_expect_success 'git-add invalidates cache-tree' ' @@ -59,7 +64,7 @@ test_expect_success 'update-index invalidates cache-tree' ' test_expect_success 'write-tree establishes cache-tree' ' test-scrap-cache-tree && git write-tree && - test_shallow_cache_tree + test_cache_tree ' test_expect_success 'test-scrap-cache-tree works' ' @@ -70,24 +75,39 @@ test_expect_success 'test-scrap-cache-tree works' ' test_expect_success 'second commit has cache-tree' ' test_commit bar && - test_shallow_cache_tree + test_cache_tree ' test_expect_success 'reset --hard gives cache-tree' ' test-scrap-cache-tree && git reset --hard && - test_shallow_cache_tree + test_cache_tree ' test_expect_success 'reset --hard without index gives cache-tree' ' rm -f .git/index && git reset --hard && - test_shallow_cache_tree + test_cache_tree ' -test_expect_failure 'checkout gives cache-tree' ' +test_expect_success 'checkout HEAD leaves cache-tree intact' ' + git read-tree HEAD && + git checkout HEAD && + test_cache_tree +' + +# NEEDSWORK: only one of these two can succeed. The second is there +# because it would be the better result. +test_expect_success 'checkout HEAD^ correctly invalidates cache-tree' ' + git checkout HEAD^ && + test_invalid_cache_tree +' + +test_expect_failure 'checkout HEAD^ gives full cache-tree' ' + git checkout master && + git read-tree HEAD && git checkout HEAD^ && - test_shallow_cache_tree + test_cache_tree ' test_done -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 18:45 ` Thomas Rast @ 2012-02-20 20:35 ` Jeff King 2012-02-20 22:04 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-20 20:35 UTC (permalink / raw) To: Thomas Rast Cc: Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Junio C Hamano, Git Mailing List On Mon, Feb 20, 2012 at 07:45:59PM +0100, Thomas Rast wrote: > > + o->result.cache_tree = o->src_index->cache_tree; > > o->src_index = NULL; > > ret = check_updates(o) ? (-2) : 0; > > if (o->dst_index) > > Brilliant. I know I'm stealing Junio's punchline, but please make it so > :-) > > Browsing around in history, it seems that this was silently broken by > 34110cd (Make 'unpack_trees()' have a separate source and destination > index, 2008-03-06), which introduced the distinction between source and > destination index. Before that they were the same, so the cache tree > would have been updated correctly. OK, good. When you write a one-liner that makes a huge change in performance, it is usually a good idea to think to yourself "no, it couldn't be this easy, could it?". But after more discussion from people more clueful than I (this is the first time I've even looked at cache-tree code), I'm feeling like this is the right direction, at least, if not exactly the right patch. And seeing that it is in fact a regression in 34110cd, and that the existing cache-tree invalidations predate that makes me feel better. At one point, at least, they were complete and we were depending on them to be accurate. Things may have changed since then, of course, but I at least know that they were sufficient in 34110cd^. > +# NEEDSWORK: only one of these two can succeed. The second is there > +# because it would be the better result. > +test_expect_success 'checkout HEAD^ correctly invalidates cache-tree' ' > + git checkout HEAD^ && > + test_invalid_cache_tree > +' > + > +test_expect_failure 'checkout HEAD^ gives full cache-tree' ' > + git checkout master && > + git read-tree HEAD && > git checkout HEAD^ && > - test_shallow_cache_tree > + test_cache_tree > ' I think you can construct two tests that will both work in the "ideal" case. In the first one, you move to a tree that updates "foo", and therefore the root cache-tree is invalidated. In the second, you update "subdir1/foo" in the index, then move to a commit that differs in "subdir1/bar" and "subdir2/bar". You should see that subdir2 has the cache-tree of the destination commit, but that subdir1 is invalidated (and therefore the root is also invalidated). That will fail with my patch, of course, as it would invalidate subdir2, also; so it would just be an expect_failure for somebody in the future. In general, t0090 could benefit from using a larger tree. For example, the add test does "git add foo" and checks that the root cache-tree was invalidated. But it should _also_ check that the cache-tree for a subdirectory is _not_ invalidated (and it isn't; git-add does the right thing). I'll see if I can work up some fancier tests, too. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 20:35 ` Jeff King @ 2012-02-20 22:04 ` Junio C Hamano 2012-02-20 22:41 ` Jeff King 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-20 22:04 UTC (permalink / raw) To: Jeff King Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List Jeff King <peff@peff.net> writes: > ... Things may have changed since then, of course, but I at > least know that they were sufficient in 34110cd^. Looking at where cache_tree_free() is called, I think back then the two-way merge was deemed OK, but we did not trust three-way merge or merge-recursive at all. > I think you can construct two tests that will both work in the "ideal" > case. In the first one, you move to a tree that updates "foo", and > therefore the root cache-tree is invalidated. I have to warn you that under-invalidation of cache-tree is extremely hard to find. The only way I know, which I had to resort to when dealing with a handful of instances of under-invalidation bugs, is to run write-tree with potentially corrupt cache-tree and then using the same index but with the cache-tree purged, run another write-tree and check to see if two trees match. > In general, t0090 could benefit from using a larger tree. For example, > the add test does "git add foo" and checks that the root cache-tree was > invalidated. But it should _also_ check that the cache-tree for a > subdirectory is _not_ invalidated (and it isn't; git-add does the right > thing). It is OK to check that we do not over-invalidate for performance, but it is a lot more important to make sure we do not under-invalidate for correctness. I am a bit worried that you seem to be putting more stress on the former. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 22:04 ` Junio C Hamano @ 2012-02-20 22:41 ` Jeff King 2012-02-20 23:31 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Jeff King @ 2012-02-20 22:41 UTC (permalink / raw) To: Junio C Hamano Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List On Mon, Feb 20, 2012 at 02:04:26PM -0800, Junio C Hamano wrote: > > ... Things may have changed since then, of course, but I at > > least know that they were sufficient in 34110cd^. > > Looking at where cache_tree_free() is called, I think back then the > two-way merge was deemed OK, but we did not trust three-way merge or > merge-recursive at all. Thanks, I'll take a look more closely at those cases. > It is OK to check that we do not over-invalidate for performance, but it > is a lot more important to make sure we do not under-invalidate for > correctness. I am a bit worried that you seem to be putting more stress > on the former. I think it is just selection bias of the specific parts of his tests that I was responding to. I completely agree that correctness is way more important, and I'm also trying to come up with tests to validate correctness. I just wasn't talking about them there. I still think replaying real-world test cases is going to be more likely to find issues in invalidation. I can come up with lots of simple test-cases, but they're not likely to find anything we wouldn't find in the code with trivial inspection. I think a combination of careful analysis and real-world validation is going to be more helpful in the long run than the kind of simplistic tests that are in t0090. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 22:41 ` Jeff King @ 2012-02-20 23:31 ` Junio C Hamano 2012-02-21 7:21 ` Piotr Krukowiecki 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-20 23:31 UTC (permalink / raw) To: Jeff King Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List Jeff King <peff@peff.net> writes: > I still think replaying real-world test cases is going to be more likely > to find issues in invalidation. Yes, but it depends on what kind of replaying you have in mind. It is very hard to come up with "replaying real-world test case". For example, randomly picking commit pair <$A,$B> from the kernel repository and running git reset --hard $A git checkout $B T0=$(git write-tree) drop-cache-tree T1=$(git write-tree) test "$T0" = "$T1" && test "$T0" = $(git rev-parse $B^{tree}) is necessary but I do not think that is sufficient. We also want to do something like: git reset --hard $A ... modify paths that do not change between $A and $B ... git add these paths git write-tree git checkout $B with and without the "git write-tree" to see the part of the cache-tree smudged by the modification behaves sanely. The codepath that is used to deal with the case where the index does not match $A but matches $B is also different, so the "modify path and git add" step would have to be crafted carefully to touch all bases. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 23:31 ` Junio C Hamano @ 2012-02-21 7:21 ` Piotr Krukowiecki 0 siblings, 0 replies; 52+ messages in thread From: Piotr Krukowiecki @ 2012-02-21 7:21 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Thomas Rast, Nguyen Thai Ngoc Duy, Git Mailing List On Tue, Feb 21, 2012 at 12:31 AM, Junio C Hamano <gitster@pobox.com> wrote: > Jeff King <peff@peff.net> writes: [...] Hi, I hope the fixes will also help git-svn users? I.e. there's a lot of rebasing (and cherry-picking - at least in my case) and probably some other stuff going on under hood. I.e. I hope that if I git-svn rebase or cherry-pick, it won't invalidate the cache... Thanks, -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 15:11 ` Jeff King 2012-02-20 18:45 ` Thomas Rast @ 2012-02-20 20:08 ` Junio C Hamano 2012-02-20 20:17 ` Jeff King 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-20 20:08 UTC (permalink / raw) To: Jeff King Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Piotr Krukowiecki, Junio C Hamano, Git Mailing List Jeff King <peff@peff.net> writes: > 4. At the end of unpack_trees, we forget about src_index, and copy > o->result into *o->dst_index byte for byte. I.e., we overwrite > the_index.cache_tree, which has been properly updated the whole > time, I strongly suspect that "properly updated" part needs to be thoroughly audited. I wouldn't be surprised that this behaviour is what we did when we split src_index vs dst_index when he rewrote unpack_trees() in order to emulate the original "unpack-trees is beyond salvation because it does not maintain cache tree correctly, just nuke it" behaviour. > But it does not actually insert the _destination_ tree into the cache > tree. Which we can do in certain situations, but only if there were no > paths in the tree that were left unchanged (e.g., you modify "foo", then > "git checkout HEAD^", which updates "bar". Your tree does not match > HEAD^, and must be invalidated). While it would be cool to be able to > handle those complex cases,... It may look cool but it may not be a good change. You are spending extra cycles to optimize for the next write-tree that may not happen before the index is further updated. > I think this implementation matches the intent of the original calls to > cache_tree_invalidate_path sprinkled throughout unpack-trees.c. Yes, and as long as we invalidate all the directories that need to be invalidated during the unpack-tree operation, I think it is a correct thing to do. > But I > have to say that it seems a little odd for us to be modifying the > o->src_index throughout the whole thing. Yes, that part is logically *wrong*. I think it is a remnant from the days when there was no distinction between src_index and dst_index. > I would think the right thing > would be to make a deep copy of o->src_index->cache_tree into > o->result.cache_tree as the very first thing, and then update > o->result.cache_tree throughout the tree traversal. Yes. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 20:08 ` Junio C Hamano @ 2012-02-20 20:17 ` Jeff King 0 siblings, 0 replies; 52+ messages in thread From: Jeff King @ 2012-02-20 20:17 UTC (permalink / raw) To: Junio C Hamano Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Piotr Krukowiecki, Git Mailing List On Mon, Feb 20, 2012 at 12:08:03PM -0800, Junio C Hamano wrote: > > 4. At the end of unpack_trees, we forget about src_index, and copy > > o->result into *o->dst_index byte for byte. I.e., we overwrite > > the_index.cache_tree, which has been properly updated the whole > > time, > > I strongly suspect that "properly updated" part needs to be thoroughly > audited. I wouldn't be surprised that this behaviour is what we did when > we split src_index vs dst_index when he rewrote unpack_trees() in order to > emulate the original "unpack-trees is beyond salvation because it does not > maintain cache tree correctly, just nuke it" behaviour. Yep, I am also concerned about that. > > But it does not actually insert the _destination_ tree into the cache > > tree. Which we can do in certain situations, but only if there were no > > paths in the tree that were left unchanged (e.g., you modify "foo", then > > "git checkout HEAD^", which updates "bar". Your tree does not match > > HEAD^, and must be invalidated). While it would be cool to be able to > > handle those complex cases,... > > It may look cool but it may not be a good change. You are spending extra > cycles to optimize for the next write-tree that may not happen before the > index is further updated. I don't think it would be too many cycles; you would have to mark each tree you enter as having items from the left-hand tree or the right-hand tree. If only one, you can reuse the cache-tree entry (or tree sha1, if coming from a tree). Otherwise, you must invalidate. And it doesn't just help the next write-tree, but any intermediate index diffs. Of course any such change would need timings to justify it, though. That being said, I think just invalidating really covers 99% of the cases. What we really care about is that when I modify kernel/foo.c, the ~2300 other directories (besides "" and "kernel") don't need rebuilt, and that is relatively simple to do. Even if doing it the other way produced a tiny speedup, I would be concerned with the increase in code complexity. > > I think this implementation matches the intent of the original calls to > > cache_tree_invalidate_path sprinkled throughout unpack-trees.c. > > Yes, and as long as we invalidate all the directories that need to be > invalidated during the unpack-tree operation, I think it is a correct > thing to do. OK. I'll do some reading of the code to convince myself that the unpack_trees callbacks are doing the right thing. I'm not sure of a good automatic test that would detect a failure there. Just making test cases is going to end up too contrived, unless we are missing something really obvious. I'm thinking maybe something like replaying the commit history of linux-2.6 and making sure that each the tree generated by the cache-tree in each case matches the actual committed tree. > > But I > > have to say that it seems a little odd for us to be modifying the > > o->src_index throughout the whole thing. > > Yes, that part is logically *wrong*. I think it is a remnant from the > days when there was no distinction between src_index and dst_index. OK. I'll include a fix for that in the series I prepare. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy 2012-02-20 14:39 ` Jeff King @ 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy 2012-02-21 19:16 ` Junio C Hamano 2012-02-22 3:32 ` Junio C Hamano 1 sibling, 2 replies; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-21 14:45 UTC (permalink / raw) To: Junio C Hamano Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List On Mon, Feb 20, 2012 at 9:36 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote: >> > Interestingly, on my git.git repo, I had an empty cache. Running "git >> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems >> > that running "git checkout" empties the cache. So perhaps git could do >> > better about keeping the cache valid over time. >> >> test_expect_failure 'checkout gives cache-tree' ' >> git checkout HEAD^ && >> test_shallow_cache_tree >> ' >> >> ;-) > > Quick and dirty that passes that test. I'm aware that Jeff's tackling at lower level, which retains cache-tree for many more cases. But this patch seems simple and safe to me, and in my experience this case happens quite often (or maybe I tend to keep my index clean). Junio, any chance this patch may get in? > -- 8< -- > diff --git a/builtin/checkout.c b/builtin/checkout.c > index 5bf96ba..c06287a 100644 > --- a/builtin/checkout.c > +++ b/builtin/checkout.c > @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts) > die(_("diff_setup_done failed")); > add_pending_object(&rev, head, NULL); > run_diff_index(&rev, 0); > + if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) { > + struct tree *tree = parse_tree_indirect(head->sha1); > + prime_cache_tree(&active_cache_tree, tree); > + } > } > > static void describe_detached_head(const char *msg, struct commit *commit) > @@ -493,13 +497,13 @@ static int merge_working_tree(struct checkout_opts *opts, > } > } > > + if (!opts->force && !opts->quiet) > + show_local_changes(&new->commit->object, &opts->diff_options); > + > if (write_cache(newfd, active_cache, active_nr) || > commit_locked_index(lock_file)) > die(_("unable to write new index file")); > > - if (!opts->force && !opts->quiet) > - show_local_changes(&new->commit->object, &opts->diff_options); > - > return 0; > } > > -- 8< -- > -- > Duy -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy @ 2012-02-21 19:16 ` Junio C Hamano 2012-02-22 2:12 ` Nguyen Thai Ngoc Duy 2012-02-22 10:34 ` Nguyen Thai Ngoc Duy 2012-02-22 3:32 ` Junio C Hamano 1 sibling, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2012-02-21 19:16 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > I'm aware that Jeff's tackling at lower level, which retains > cache-tree for many more cases. > > But this patch seems simple and safe > to me, and in my experience this case happens quite often (or maybe I > tend to keep my index clean). Junio, any chance this patch may get in? I do not think we are talking about a duplicated effort here. By definition, the change to hook into unpack_trees() and making sure we invalidate all the necessary subtrees in the cache cannot give you a cache tree that is more populated than what you started with. And the train of thought in Peff's message is to improve this invalidation---we currently invalidate everything ;-) Somebody has to populate the cache tree fully when we _know_ the index matches a certain tree, and adding a call to prime_cache_tree() in strategic places is a way to do so. The most obvious is write-tree, but there are a few other existing codepaths that do so. Because prime_cache_tree() by itself is a fairly expensive operation that reads all the trees recursively, its benefits need to be evaluated. It should to happen only in an operation that is already heavy-weight, is likely to have read all the trees and have many of them in-core cache, and also relatively rarely happens compared to "git add" so that the cost can be amortised over time, such as "reset --(hard|mixed)". Switching branches is likely to fall into that category, but that is just my gut feeling. I would feel better at night if somebody did a benchmark ;-) One thing we do not currently do anywhere that _might_ be of merit is to make a call to cache_tree_update() instead of prime_cache_tree() when we already know that only a very small subpart of the cache-tree is invalid and it is cheaper to repair it by rehashing only a small portion of the index than to re-prime the entire cache tree with prime_cache_tree(). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-21 19:16 ` Junio C Hamano @ 2012-02-22 2:12 ` Nguyen Thai Ngoc Duy 2012-02-22 2:55 ` Junio C Hamano 2012-02-22 10:34 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-22 2:12 UTC (permalink / raw) To: Junio C Hamano Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List On Wed, Feb 22, 2012 at 2:16 AM, Junio C Hamano <gitster@pobox.com> wrote: > Because prime_cache_tree() by itself is a fairly expensive operation that > reads all the trees recursively, its benefits need to be evaluated. It > should to happen only in an operation that is already heavy-weight, is > likely to have read all the trees and have many of them in-core cache, and > also relatively rarely happens compared to "git add" so that the cost can > be amortised over time, such as "reset --(hard|mixed)". > > Switching branches is likely to fall into that category, but that is just > my gut feeling. I would feel better at night if somebody did a benchmark > ;-) In this particular case, "git diff --cached" is run internally, so I say all trees are read once and hopefully most of them still in OS cache. Will run some benchmark, maybe with the coming perf test suite. > One thing we do not currently do anywhere that _might_ be of merit is to > make a call to cache_tree_update() instead of prime_cache_tree() when we > already know that only a very small subpart of the cache-tree is invalid > and it is cheaper to repair it by rehashing only a small portion of the > index than to re-prime the entire cache tree with prime_cache_tree(). That makes me think if "diff --cached" can take advantage of cache-tree to avoid walking down valid cached trees and do tree-tree diff in those cases instead. Not sure if it gains us anything but code complexity. -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-22 2:12 ` Nguyen Thai Ngoc Duy @ 2012-02-22 2:55 ` Junio C Hamano 2012-02-22 12:54 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-22 2:55 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > That makes me think if "diff --cached" can take advantage of > cache-tree to avoid walking down valid cached trees and do tree-tree > diff in those cases instead. Not sure if it gains us anything but code > complexity. Why do I have this funny feeling that we saw that comment in this thread already? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-22 2:55 ` Junio C Hamano @ 2012-02-22 12:54 ` Nguyen Thai Ngoc Duy 2012-02-22 13:17 ` Thomas Rast 0 siblings, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-22 12:54 UTC (permalink / raw) To: Junio C Hamano Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List On Wed, Feb 22, 2012 at 9:55 AM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> That makes me think if "diff --cached" can take advantage of >> cache-tree to avoid walking down valid cached trees and do tree-tree >> diff in those cases instead. Not sure if it gains us anything but code >> complexity. > > Why do I have this funny feeling that we saw that comment in this thread > already? Simple. You wrote it and I missed it. On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster@pobox.com> wrote: > Jeff King <peff@peff.net> writes: > >> That being said, we do have an index extension to store the tree sha1 of >> whole directories (i.e., we populate it when we write a whole tree or >> subtree into the index from the object db, and it becomes invalidated >> when a file becomes modified). This optimization is used by things like >> "git commit" to avoid having to recreate the same sub-trees over and >> over when creating tree objects from the index. But we could also use it >> here to avoid having to even read the sub-tree objects from the object >> db. > > Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20) > perhaps? This optimizes the case when a cached tree matches entirely.I wonder whether it's faster if we switch to tree-tree diff whenever we find valid cached trees. If cache-tree is fully valid, "git diff --cached foo" would be equivalent to "git diff HEAD foo". I tried "git diff --raw HEAD HEAD~100" (where HEAD was v3.1-rc1-272-g73e0881 on linux-2.6) and "git diff --cached --raw HEAD~100" with no cache-tree. The former is a little bit faster than the latter (177ms vs 275ms). On gentoo-x86, 70k worktree files, it's 4.33s vs 4.45s. But in tree-tree diff we pay high in cold cache case for loading trees from "HEAD". So no, probably not worth more code changes. Your optimization is good enough. -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-22 12:54 ` Nguyen Thai Ngoc Duy @ 2012-02-22 13:17 ` Thomas Rast 0 siblings, 0 replies; 52+ messages in thread From: Thomas Rast @ 2012-02-22 13:17 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Junio C Hamano, Jeff King, Piotr Krukowiecki, Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster@pobox.com> wrote: >> Jeff King <peff@peff.net> writes: >> >>> That being said, we do have an index extension to store the tree sha1 of >>> whole directories (i.e., we populate it when we write a whole tree or >>> subtree into the index from the object db, and it becomes invalidated >>> when a file becomes modified). This optimization is used by things like >>> "git commit" to avoid having to recreate the same sub-trees over and >>> over when creating tree objects from the index. But we could also use it >>> here to avoid having to even read the sub-tree objects from the object >>> db. >> >> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20) >> perhaps? > > This optimizes the case when a cached tree matches entirely.I wonder > whether it's faster if we switch to tree-tree diff whenever we find > valid cached trees. If cache-tree is fully valid, "git diff --cached > foo" would be equivalent to "git diff HEAD foo". Not necessarily; the cache-tree is valid if it faithfully represents what is in the index. It does not have any direct relation to HEAD. > I tried "git diff --raw HEAD HEAD~100" (where HEAD was > v3.1-rc1-272-g73e0881 on linux-2.6) and "git diff --cached --raw > HEAD~100" with no cache-tree. The former is a little bit faster than > the latter (177ms vs 275ms). On gentoo-x86, 70k worktree files, it's > 4.33s vs 4.45s. But in tree-tree diff we pay high in cold cache case > for loading trees from "HEAD". So no, probably not worth more code > changes. Your optimization is good enough. I'm still wondering about using mincore() to good effect. I tried it for git-grep, but it ended up slowing things down. However, it irks me that in some cases a clueful use of one form over the other can really make a huge performance difference, e.g., git grep stuff git grep HEAD stuff If I am in a big repository that I haven't used in a while, the HEAD form will be much faster as the worktree search would fault many files. OTOH if I am in a heavily-used repository (and perhaps just said 'make' minutes ago) the worktree version will avoid the pack decompression effort. Sadly this also has the problem that we must first determine whether substituting HEAD for the worktree (or vice versa) is valid at all. For grep perhaps there could be a "just do a fast search somewhere" option since usually you are looking for something that hasn't changed in ages. Ok, that was almost completely beside the point of this thread. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-21 19:16 ` Junio C Hamano 2012-02-22 2:12 ` Nguyen Thai Ngoc Duy @ 2012-02-22 10:34 ` Nguyen Thai Ngoc Duy 1 sibling, 0 replies; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-22 10:34 UTC (permalink / raw) To: Junio C Hamano Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List On Tue, Feb 21, 2012 at 11:16:37AM -0800, Junio C Hamano wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > > > I'm aware that Jeff's tackling at lower level, which retains > > cache-tree for many more cases. > > > > But this patch seems simple and safe > > to me, and in my experience this case happens quite often (or maybe I > > tend to keep my index clean). Junio, any chance this patch may get in? > > I do not think we are talking about a duplicated effort here. > > By definition, the change to hook into unpack_trees() and making sure we > invalidate all the necessary subtrees in the cache cannot give you a cache > tree that is more populated than what you started with. And the train of > thought in Peff's message is to improve this invalidation---we currently > invalidate everything ;-) > > Somebody has to populate the cache tree fully when we _know_ the index > matches a certain tree, and adding a call to prime_cache_tree() in > strategic places is a way to do so. The most obvious is write-tree, but > there are a few other existing codepaths that do so. > > Because prime_cache_tree() by itself is a fairly expensive operation that > reads all the trees recursively, its benefits need to be evaluated. It > should to happen only in an operation that is already heavy-weight, is > likely to have read all the trees and have many of them in-core cache, and > also relatively rarely happens compared to "git add" so that the cost can > be amortised over time, such as "reset --(hard|mixed)". It's tradeoff. As you said prime_cache_tree() is expensive. cache_tree_update is supposed to be cheap. But cache_tree_update() when empty is even more expensive than prime_cache_tree(). So it boils down how much cache-tree we have after unpack_trees() and whether it's worth repopulate cache-tree again. > Switching branches is likely to fall into that category, but that is just > my gut feeling. I would feel better at night if somebody did a benchmark > ;-) I timed prime_cache_tree() and cache_tree_update() while switching branch on linux-2.6, between v2.6.32 and a quite recent version. prime_cache_tree() took ~55ms while cache_tree_update() 150ms or 90ms (depending on final tree). It depends on your view, whether 55ms is expensive on such a reasonably large repository. I took several seconds for me to complete the checkout though. Checking out with "-q" prime_cache_tree() took 145ms and 80ms respectively, as expensive as cache_tree_update() If cache-tree is only used at commit time, I think we could delay prime_cache_tree() until absolutely needed. We could add an optional index extension recording the fact that index matches certain tree. On the first cache_tree_invalidate_path(), if cache-tree is still empty, we prime cache-tree, then invalidate the requested path. It would then add no cost to a quick branch switch. But if diff-cached also takes advantage of cache-tree, it's a different story. Anyway, I think this patch does better than my last one -- 8< -- diff --git a/builtin/checkout.c b/builtin/checkout.c index 6b9061f..e7eaeec 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -387,6 +387,7 @@ static int merge_working_tree(struct checkout_opts *opts, int ret; struct lock_file *lock_file = xcalloc(1, sizeof(struct lock_file)); int newfd = hold_locked_index(lock_file, 1); + int head_index_mismatch = 1; if (read_cache_preload(NULL) < 0) return error(_("corrupt index file")); @@ -396,6 +397,7 @@ static int merge_working_tree(struct checkout_opts *opts, ret = reset_tree(new->commit->tree, opts, 1); if (ret) return ret; + head_index_mismatch = 0; } else { struct tree_desc trees[2]; struct tree *tree; @@ -490,7 +492,27 @@ static int merge_working_tree(struct checkout_opts *opts, ret = reset_tree(new->commit->tree, opts, 0); if (ret) return ret; - } + } else + head_index_mismatch = topts.head_index_mismatch; + } + + /* + * Currently cache-tree is always destroyed after + * unpack_trees(). It's probably a good idea to repopulate + * cache-tree. If the user makes a few modifications and + * commits, tree generation woulda be cheap. If they switch + * away again, not so cheap. + * + * When unpack_trees() learns to retains as much cache-tree as + * possible, this code probably does not help much on tree + * generation, unless the tree difference between to heads are + * too far, little cache-tree can be kept. + */ + if (!head_index_mismatch && + !cache_tree_fully_valid(active_cache_tree)) { + if (!new->commit->tree->object.parsed) + parse_tree(new->commit->tree); + prime_cache_tree(&active_cache_tree, new->commit->tree); } if (write_cache(newfd, active_cache, active_nr) || diff --git a/unpack-trees.c b/unpack-trees.c index 7c9ecf6..f2c518f 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -1022,6 +1022,8 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options o->result.timestamp.nsec = o->src_index->timestamp.nsec; o->merge_size = len; mark_all_ce_unused(o->src_index); + if (o->fn != twoway_merge) + o->head_index_mismatch = 1; /* * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries @@ -1736,6 +1738,8 @@ int twoway_merge(struct cache_entry **src, struct unpack_trees_options *o) (oldtree && newtree && !same(oldtree, newtree) && /* 18 and 19 */ same(current, newtree))) { + if (!newtree || (newtree && !same(current, newtree))) + o->head_index_mismatch = 1; return keep_entry(current, o); } else if (oldtree && !newtree && same(current, oldtree)) { diff --git a/unpack-trees.h b/unpack-trees.h index 5e432f5..b75b64e 100644 --- a/unpack-trees.h +++ b/unpack-trees.h @@ -48,7 +48,8 @@ struct unpack_trees_options { gently, exiting_early, show_all_errors, - dry_run; + dry_run, + head_index_mismatch; const char *prefix; int cache_bottom; struct dir_struct *dir; -- 8< -- ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy 2012-02-21 19:16 ` Junio C Hamano @ 2012-02-22 3:32 ` Junio C Hamano 2012-04-10 15:16 ` Piotr Krukowiecki 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-22 3:32 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: >> diff --git a/builtin/checkout.c b/builtin/checkout.c >> index 5bf96ba..c06287a 100644 >> --- a/builtin/checkout.c >> +++ b/builtin/checkout.c >> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts) >> die(_("diff_setup_done failed")); >> add_pending_object(&rev, head, NULL); >> run_diff_index(&rev, 0); >> + if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) { >> + struct tree *tree = parse_tree_indirect(head->sha1); >> + prime_cache_tree(&active_cache_tree, tree); >> + } >> } I think this patch is wrong on at least two counts. * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not "diff --cached HEAD". The added check does not say anything about the comparison between the index and the tree at the HEAD. * Even if we added an extra run_diff_index(&rev, 1) there, or added a call to index_differs_from() to run "diff --cached HEAD" to check what needs to be checked, it is still not quite right. On the latter point, imagine what happens in the two invocations of checkout in the following sequence: $ git reset --hard master $ git checkout master $ git checkout master The second one should notice that the cache tree is fully valid, so the internal "diff --cached" it runs should only open the top-level tree and scan entries in it, without recursing into any of the subtrees, and realize that the index is in sync with "HEAD", which should be a very cheap operation (that is the whole point of the current topic of our discussion looking at the cache-tree). Then the new code calls prime_cache_tree() to read _everything_? Probably cache_tree_fully_valid() should be called before deciding that we need to re-populate the cache tree from "HEAD". ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-22 3:32 ` Junio C Hamano @ 2012-04-10 15:16 ` Piotr Krukowiecki 2012-04-10 16:23 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Piotr Krukowiecki @ 2012-04-10 15:16 UTC (permalink / raw) To: Junio C Hamano Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Jeff King, Git Mailing List On Wed, Feb 22, 2012 at 4:32 AM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >>> diff --git a/builtin/checkout.c b/builtin/checkout.c >>> index 5bf96ba..c06287a 100644 >>> --- a/builtin/checkout.c >>> +++ b/builtin/checkout.c >>> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts) >>> die(_("diff_setup_done failed")); >>> add_pending_object(&rev, head, NULL); >>> run_diff_index(&rev, 0); >>> + if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) { >>> + struct tree *tree = parse_tree_indirect(head->sha1); >>> + prime_cache_tree(&active_cache_tree, tree); >>> + } >>> } > > I think this patch is wrong on at least two counts. > > * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not > "diff --cached HEAD". The added check does not say anything about the > comparison between the index and the tree at the HEAD. > > * Even if we added an extra run_diff_index(&rev, 1) there, or added a > call to index_differs_from() to run "diff --cached HEAD" to check what > needs to be checked, it is still not quite right. > > On the latter point, imagine what happens in the two invocations of > checkout in the following sequence: > > $ git reset --hard master > $ git checkout master > $ git checkout master > > The second one should notice that the cache tree is fully valid, so the > internal "diff --cached" it runs should only open the top-level tree > and scan entries in it, without recursing into any of the subtrees, and > realize that the index is in sync with "HEAD", which should be a very > cheap operation (that is the whole point of the current topic of our > discussion looking at the cache-tree). Then the new code calls > prime_cache_tree() to read _everything_? > > Probably cache_tree_fully_valid() should be called before deciding that we > need to re-populate the cache tree from "HEAD". Hi, could I ask what is the status of this? There were some patches posted, but I think nothing final? Thanks, -- Piotr Krukowiecki ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-04-10 15:16 ` Piotr Krukowiecki @ 2012-04-10 16:23 ` Junio C Hamano 2012-04-10 18:00 ` Jeff King 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-04-10 16:23 UTC (permalink / raw) To: Piotr Krukowiecki Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Jeff King, Git Mailing List Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes: > could I ask what is the status of this? There were some patches > posted, but I think nothing final? I do not think you meant to address your inquiry to me, but I think these patches tried out some ideas, got issues discovered in them and then got abandoned before resulting in a working code that is ready for testing. I wish there were fewer such series, but it happens (see "Stalled" section in What's cooking). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-04-10 16:23 ` Junio C Hamano @ 2012-04-10 18:00 ` Jeff King 0 siblings, 0 replies; 52+ messages in thread From: Jeff King @ 2012-04-10 18:00 UTC (permalink / raw) To: Junio C Hamano Cc: Piotr Krukowiecki, Nguyen Thai Ngoc Duy, Thomas Rast, Git Mailing List On Tue, Apr 10, 2012 at 09:23:59AM -0700, Junio C Hamano wrote: > > could I ask what is the status of this? There were some patches > > posted, but I think nothing final? > > I do not think you meant to address your inquiry to me, but I think these > patches tried out some ideas, got issues discovered in them and then got > abandoned before resulting in a working code that is ready for testing. Yes. I think we decided that we needed some pretty good testing to add the cache_tree handling back into unpack_trees. I'd still like to do that testing, but haven't done it yet. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:09 ` Thomas Rast 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy @ 2012-02-20 19:57 ` Junio C Hamano 2012-02-20 19:59 ` Thomas Rast 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-20 19:57 UTC (permalink / raw) To: Thomas Rast Cc: Jeff King, Piotr Krukowiecki, Junio C Hamano, Git Mailing List, Nguyen Thai Ngoc Duy Thomas Rast <trast@inf.ethz.ch> writes: > test_expect_failure 'checkout gives cache-tree' ' > git checkout HEAD^ && > test_shallow_cache_tree > ' Depending on what state you start the checkout from, that is not a valid test. Some form of "git reset" before the checkout to ensure the initial state is needed. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 19:57 ` Junio C Hamano @ 2012-02-20 19:59 ` Thomas Rast 0 siblings, 0 replies; 52+ messages in thread From: Thomas Rast @ 2012-02-20 19:59 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Piotr Krukowiecki, Git Mailing List, Nguyen Thai Ngoc Duy Junio C Hamano <gitster@pobox.com> writes: > Thomas Rast <trast@inf.ethz.ch> writes: > >> test_expect_failure 'checkout gives cache-tree' ' >> git checkout HEAD^ && >> test_shallow_cache_tree >> ' > > Depending on what state you start the checkout from, that is not a valid > test. Some form of "git reset" before the checkout to ensure the initial > state is needed. Oh, I was just quoting what we already had at the end of t0090 since 4eb0346f. The test preceding it runs 'git reset --hard'. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:06 ` Jeff King 2012-02-20 14:09 ` Thomas Rast @ 2012-02-20 14:16 ` Nguyen Thai Ngoc Duy 2012-02-20 14:22 ` Jeff King 2012-02-20 19:56 ` Junio C Hamano 2 siblings, 1 reply; 52+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-20 14:16 UTC (permalink / raw) To: Jeff King Cc: Piotr Krukowiecki, Junio C Hamano, Thomas Rast, Git Mailing List On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff@peff.net> wrote: > Interestingly, on my git.git repo, I had an empty cache. Running "git > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems > that running "git checkout" empties the cache. So perhaps git could do > better about keeping the cache valid over time. For fast forward case when result index matches 100% destination tree, yeah we should repopulate cache-tree. "git reset" does that. Not sure about other cases though. I don't think we can keep track what subtrees are unchanged after unpack_trees() in order to keep them. -- Duy ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:16 ` Nguyen Thai Ngoc Duy @ 2012-02-20 14:22 ` Jeff King 0 siblings, 0 replies; 52+ messages in thread From: Jeff King @ 2012-02-20 14:22 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Piotr Krukowiecki, Junio C Hamano, Thomas Rast, Git Mailing List On Mon, Feb 20, 2012 at 09:16:43PM +0700, Nguyen Thai Ngoc Duy wrote: > On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff@peff.net> wrote: > > Interestingly, on my git.git repo, I had an empty cache. Running "git > > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems > > that running "git checkout" empties the cache. So perhaps git could do > > better about keeping the cache valid over time. > > For fast forward case when result index matches 100% destination tree, > yeah we should repopulate cache-tree. "git reset" does that. Not sure > about other cases though. I don't think we can keep track what > subtrees are unchanged after unpack_trees() in order to keep them. Yeah, doing it after unpack_trees seems crazy. But I really feel like unpack_trees should be able to handle cache updates as it unpacks. It knows what is being updated and what is being merged. But maybe it is more complicated than that; I haven't looked at the code yet. -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 14:06 ` Jeff King 2012-02-20 14:09 ` Thomas Rast 2012-02-20 14:16 ` Nguyen Thai Ngoc Duy @ 2012-02-20 19:56 ` Junio C Hamano 2012-02-20 20:09 ` Jeff King 2 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2012-02-20 19:56 UTC (permalink / raw) To: Jeff King Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy Jeff King <peff@peff.net> writes: > Interestingly, on my git.git repo, I had an empty cache. Running "git > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems > that running "git checkout" empties the cache. So perhaps git could do > better about keeping the cache valid over time. At least in the early days unpack-trees built the result by manually adding an entry without calling the add_index_entry() all over the place, which meant that it was futile to pretend that there is even a slight chance that complex beast would correctly invalidate cached tree information at all the necessary places. I recall that I added a code to nuke the cache tree at the very beginning of "merging" codepaths to avoid any bogus cache tree result to be stored in the resulting index. These days, we have src_index and dst_index, and dst_index IIRC can start as empty in which case "start from kept information and selectively invalidate" would not work at all. When src_index and dst_index are the same, however, you should be able to keep the cached tree valid, at least in theory. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: git status: small difference between stating whole repository and small subdirectory 2012-02-20 19:56 ` Junio C Hamano @ 2012-02-20 20:09 ` Jeff King 0 siblings, 0 replies; 52+ messages in thread From: Jeff King @ 2012-02-20 20:09 UTC (permalink / raw) To: Junio C Hamano Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy On Mon, Feb 20, 2012 at 11:56:13AM -0800, Junio C Hamano wrote: > These days, we have src_index and dst_index, and dst_index IIRC can start > as empty in which case "start from kept information and selectively > invalidate" would not work at all. When src_index and dst_index are the > same, however, you should be able to keep the cached tree valid, at least > in theory. Yeah, I was worried that the cache invalidations sprinkled throughout unpack-trees.c would not be sufficient (and because we are invalidating, a missing invalidation would give us bogus cache info, which is Very Bad). So I think the one-liner I posted before is not sufficient in the general case, because it definitely doesn't consider where the destination is starting from. It should at least be more like: if (src_index == dst_index) { /* We would ordinarily want to do a deep copy here, but since * we know that we will be overwriting src_index in the long * run, it's OK to just take ownership of its cache_tree. */ o->result.cache_tree = o->src_index->cache_tree; o->src_index->cache_tree = NULL; } [... do the usual tree traversal here, except invalidate entries in o->result.call_tree instead of o->src_index. That makes it a no-op when src_index != dst_index (because we have no cache tree defined in result, then), and otherwise we are invalidating what will go into the result...] [then as before, we copy the result to dst_index; except now the result may have src_index's cache_tree plus any invalidations] o->result = *o->dst_index; And fortunately that does exactly what we want in all cases, because we always either read from and write to the_index, or we write to NULL (in which case we will not bother with a cache_tree for the result, and it is fixing a minor bug that we might be invalidating src_index's tree in the first place). I'm still slightly worried that we are missing some invalidation somewhere deep in unpack_tree's callbacks (especially because they _are_ callbacks, and invalidating the cache_tree properly is now a promise that the callbacks have to make). -Peff ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2012-04-10 18:00 UTC | newest] Thread overview: 52+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-10 9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy 2012-02-10 13:46 ` Piotr Krukowiecki 2012-02-10 14:37 ` Nguyen Thai Ngoc Duy 2012-02-13 16:54 ` Piotr Krukowiecki 2012-02-10 16:18 ` Piotr Krukowiecki 2012-02-14 11:34 ` Thomas Rast 2012-02-15 8:57 ` Piotr Krukowiecki 2012-02-15 11:01 ` Nguyen Thai Ngoc Duy 2012-02-15 15:14 ` Piotr Krukowiecki 2012-02-16 13:22 ` Piotr Krukowiecki 2012-02-15 19:03 ` Jeff King 2012-02-16 13:37 ` Piotr Krukowiecki 2012-02-16 14:05 ` Thomas Rast 2012-02-16 20:15 ` Junio C Hamano 2012-02-17 16:55 ` Piotr Krukowiecki 2012-02-16 19:20 ` Jeff King 2012-02-17 17:19 ` Piotr Krukowiecki 2012-02-17 20:37 ` Jeff King 2012-02-17 22:25 ` Junio C Hamano 2012-02-17 22:29 ` Jeff King 2012-02-20 8:25 ` Piotr Krukowiecki 2012-02-20 14:06 ` Jeff King 2012-02-20 14:09 ` Thomas Rast 2012-02-20 14:36 ` Nguyen Thai Ngoc Duy 2012-02-20 14:39 ` Jeff King 2012-02-20 15:11 ` Jeff King 2012-02-20 18:45 ` Thomas Rast 2012-02-20 20:35 ` Jeff King 2012-02-20 22:04 ` Junio C Hamano 2012-02-20 22:41 ` Jeff King 2012-02-20 23:31 ` Junio C Hamano 2012-02-21 7:21 ` Piotr Krukowiecki 2012-02-20 20:08 ` Junio C Hamano 2012-02-20 20:17 ` Jeff King 2012-02-21 14:45 ` Nguyen Thai Ngoc Duy 2012-02-21 19:16 ` Junio C Hamano 2012-02-22 2:12 ` Nguyen Thai Ngoc Duy 2012-02-22 2:55 ` Junio C Hamano 2012-02-22 12:54 ` Nguyen Thai Ngoc Duy 2012-02-22 13:17 ` Thomas Rast 2012-02-22 10:34 ` Nguyen Thai Ngoc Duy 2012-02-22 3:32 ` Junio C Hamano 2012-04-10 15:16 ` Piotr Krukowiecki 2012-04-10 16:23 ` Junio C Hamano 2012-04-10 18:00 ` Jeff King 2012-02-20 19:57 ` Junio C Hamano 2012-02-20 19:59 ` Thomas Rast 2012-02-20 14:16 ` Nguyen Thai Ngoc Duy 2012-02-20 14:22 ` Jeff King 2012-02-20 19:56 ` Junio C Hamano 2012-02-20 20:09 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).