* [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply. @ 2006-04-23 23:52 Junio C Hamano 2006-04-24 1:25 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2006-04-23 23:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This updates git-apply to maintain cache-tree information. With this and the previous write-tree patch, repeated "apply --index" followed by "write-tree" on a huge tree will hopefully become faster. Signed-off-by: Junio C Hamano <junkio@cox.net> --- * ... then the big rock falls. With this, I tried to apply and then write-tree "diff-tree -p $commit^1 $commit" on top of "$commit^1" for the last 20 or so commits in the kernel tree. The "master" version takes 0.15 second per patch on my Duron 750 with 700MB, while this one does that in 0.06 second. This also helps the memory pressure because we do not have to regenerate unchanged trees. 810 minor faults with the patch vs 2150 minor faults without. apply.c | 16 ++++++++++++++-- 1 files changed, 14 insertions(+), 2 deletions(-) 94005d7bb4b55e9984f6505a8916d23d304ccc4d diff --git a/apply.c b/apply.c index 269210a..f7cdefa 100644 --- a/apply.c +++ b/apply.c @@ -8,9 +8,14 @@ */ #include <fnmatch.h> #include "cache.h" +#include "cache-tree.h" #include "quote.h" #include "blob.h" +static unsigned char active_cache_sha1[20]; +static struct cache_tree *active_cache_tree; + + // --check turns on checking that the working tree matches the // files that are being modified, but doesn't apply the patch // --stat does just a diffstat, and doesn't actually apply @@ -1717,6 +1722,7 @@ static void remove_file(struct patch *pa if (write_index) { if (remove_file_from_cache(patch->old_name) < 0) die("unable to remove %s from index", patch->old_name); + cache_tree_invalidate_path(active_cache_tree, patch->old_name); } unlink(patch->old_name); } @@ -1815,6 +1821,7 @@ static void create_file(struct patch *pa mode = S_IFREG | 0644; create_one_file(path, mode, buf, size); add_index_file(path, mode, buf, size); + cache_tree_invalidate_path(active_cache_tree, path); } static void write_out_one_result(struct patch *patch) @@ -1912,8 +1919,9 @@ static int apply_patch(int fd, const cha if (write_index) newfd = hold_index_file_for_update(&cache_file, get_index_file()); if (check_index) { - if (read_cache() < 0) + if (read_cache_1(active_cache_sha1) < 0) die("unable to read index file"); + active_cache_tree = read_cache_tree(active_cache_sha1); } if ((check || apply) && check_patch_list(list) < 0) @@ -1923,9 +1931,13 @@ static int apply_patch(int fd, const cha write_out_results(list, skipped_patch); if (write_index) { - if (write_cache(newfd, active_cache, active_nr) || + if (write_cache_1(newfd, active_cache, active_nr, + active_cache_sha1) || commit_index_file(&cache_file)) die("Unable to write new cachefile"); + cache_tree_update(active_cache_tree, + active_cache, active_nr, 1); + write_cache_tree(active_cache_sha1, active_cache_tree); } if (show_index_info) -- 1.3.0.g623a ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply. 2006-04-23 23:52 [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply Junio C Hamano @ 2006-04-24 1:25 ` Junio C Hamano 2006-04-24 2:47 ` Junio C Hamano 2006-04-24 3:03 ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano 0 siblings, 2 replies; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 1:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Junio C Hamano <junkio@cox.net> writes: > * ... then the big rock falls. With this, I tried to apply and > then write-tree "diff-tree -p $commit^1 $commit" on top of > "$commit^1" for the last 20 or so commits in the kernel tree. > The "master" version takes 0.15 second per patch on my Duron > 750 with 700MB, while this one does that in 0.06 second. > This also helps the memory pressure because we do not have to > regenerate unchanged trees. 810 minor faults with the patch > vs 2150 minor faults without. Sorry, but not really. The patch is wrong and the measurement was flawed. It was doing unnecessarily more work in git-apply, which made git-write-tree a no-op, and I was measuring only git-write-tree. The following goes on top of the series to remove that unnecessary work from git-apply. Unfortunately this makes the overall combination a bit slower than before X-<. But I have not optimized cache-tree.c for speed; for example, its subtree lists are not even sorted. Once that is done we may get decent speedups from the combo. --- diff --git a/apply.c b/apply.c index 5fa2c1e..e283df3 100644 --- a/apply.c +++ b/apply.c @@ -1935,8 +1935,6 @@ static int apply_patch(int fd, const cha active_cache_sha1) || commit_index_file(&cache_file)) die("Unable to write new cachefile"); - cache_tree_update(active_cache_tree, - active_cache, active_nr, 1); write_cache_tree(active_cache_sha1, active_cache_tree); } ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply. 2006-04-24 1:25 ` Junio C Hamano @ 2006-04-24 2:47 ` Junio C Hamano 2006-04-24 3:03 ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano 1 sibling, 0 replies; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 2:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Junio C Hamano <junkio@cox.net> writes: > Junio C Hamano <junkio@cox.net> writes: > >> * ... then the big rock falls. With this, I tried to apply and >> then write-tree "diff-tree -p $commit^1 $commit" on top of >> "$commit^1" for the last 20 or so commits in the kernel tree. >> The "master" version takes 0.15 second per patch on my Duron >> 750 with 700MB, while this one does that in 0.06 second. >> This also helps the memory pressure because we do not have to >> regenerate unchanged trees. 810 minor faults with the patch >> vs 2150 minor faults without. > > Sorry, but not really. The patch is wrong and the measurement > was flawed. Again, sorry, but there are some more bugs in the cache-tree code that I need to fix and re-measure. In the meantime, please do not use it on your production repositories. It does not seem to produce corrupt trees, but it creates broken index.aux for no good reason. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 5/4] Minimum fixups to cache-tree 2006-04-24 1:25 ` Junio C Hamano 2006-04-24 2:47 ` Junio C Hamano @ 2006-04-24 3:03 ` Junio C Hamano 2006-04-24 5:41 ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano 1 sibling, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 3:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: git The first hunk is the repeat of the previous "oops". The second is a real fix. With this applied, 100-patch series (going from present to past at the Linux 2.6 kernel tip) applies correctly with the following stats: Trying w/o the patch... 24.57user 4.48system 0:34.47elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+322156minor)pagefaults 0swaps Trying w/ the patch... 15.81user 3.00system 0:20.84elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+252063minor)pagefaults 0swaps So there definitely is an improvement. *Grin*. The script used to bench this is attached at the end. "linus" is the tracking branch and currently it points at v2.6.17-rc2-g6b426e7 $HOME/git-master/bin is where I installed the "master" version, and $J (aka ../git.junio/) is the freshly compiled area with the patch series applied (all five of four ;-). -- >8 -- Minimum fixups to make things usable. --- diff --git a/apply.c b/apply.c index 5fa2c1e..e283df3 100644 --- a/apply.c +++ b/apply.c @@ -1935,8 +1935,6 @@ static int apply_patch(int fd, const cha active_cache_sha1) || commit_index_file(&cache_file)) die("Unable to write new cachefile"); - cache_tree_update(active_cache_tree, - active_cache, active_nr, 1); write_cache_tree(active_cache_sha1, active_cache_tree); } diff --git a/cache-tree.c b/cache-tree.c index 4dbdb65..f6d1dd1 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -348,10 +348,7 @@ #endif } for (i = 0; i < it->subtree_nr; i++) { struct cache_tree_sub *down = it->down[i]; - int len = pathlen + down->namelen; - memcpy(path + pathlen, down->name, down->namelen); - path[len] = '/'; - buffer = write_one(down->cache_tree, path, len+1, + buffer = write_one(down->cache_tree, down->name, down->namelen, buffer, size, offset); } return buffer; -- #!/bin/sh J=../git.junio M=$HOME/git-master/bin export J M git reset --hard linus previous= git rev-list -n 100 HEAD | while read commit do if test -n "$previous" then git-diff-tree -p "$previous" "$commit" >"$commit-$previous" echo "$commit $previous" fi previous="$commit" done >series git reset --hard linus echo "Trying w/o the patch..." /usr/bin/time sh -c ' while read commit previous do $M/git-apply --whitespace=nowarn --index "$commit-$previous" $M/git-write-tree done <series ' >out-1 tree1=`git write-tree` git reset --hard linus echo "Trying w/ the patch..." $J/git-write-tree /usr/bin/time sh -c ' while read commit previous do $J/git-apply --whitespace=nowarn --index "$commit-$previous" $J/git-write-tree done <series ' >out-2 tree2=`git write-tree` test "$tree1" = "$tree2" || exit 1 cmp out-1 out-2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* 50% speed-up of "git-apply && git-write-tree" sequence. 2006-04-24 3:03 ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano @ 2006-04-24 5:41 ` Junio C Hamano 2006-04-24 21:31 ` maintenance of cache-tree data Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 5:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: git I've fixed-up the series and they are now sitting near the tip of "pu" branch tonight. With some more testing I am hoping to merge it in "next" soon. I should recap the overall idea for general public here. When applying a series of patches to your working tree and making commits out of the results, what happens is roughly this: 1. Your index file exactly matches your HEAD commit; you may have local modifications to some paths in your working tree as long as they do not interfere with the patch application. 2. "git-apply --index" reads the index file, reads the patch, applies them to both index and the working tree, and writes the updated index file out. 3. "git-write-tree" reads the index file, writes the index entries out as a set of hierarchical tree objects, the top of which is wrapped in a new commit object that has the current HEAD as the parent, and the commit becomes the new HEAD. You repeat 1-3 as many times as you apply patches. When dealing with a huge tree, with a typical patch touching only a handful paths, this is not very efficient. A recent kernel source tree has about 1200 directories, each of which needs to be made into the "tree" object format in-core from the data in the index file, and we hash it to see if we already have that tree, and if not we write it out after compression (if we do have it, we do not do anything after checking). The updated code does this: (1) When git-write-tree writes out trees from the index, we store <directory, tree SHA1> pair for all the trees we compute, in .git/index.aux file. (2) When git-write-tree starts up, it reads the index file, and also reads the .git/index.aux file. The latter file contains a fingerprint of the former, so if the index was updated without updating the index.aux, what is read is discarded. However, if .git/index.aux is up-to-date, what it records can be reused when we scan the index. (3) There are a few commands that updates the index file. git-update-index is the most well-known one, but read-tree and apply also update it. No command other than apply updates index.aux when writing out index (yet). However, git-apply has been told about it. It reads index and index.aux file, and when it updates an index entry, it _invalidates_ the directory information it read from index.aux file, and writes the updated index.aux file out when it is done. Note. Earlier I sent out a fix ([PATCH 5/4]) that contained a two-line removal from apply.c; the code was doing more than just invalidating entries, but actually recomputing the tree. The fixed code does not hash tree information when git-apply writes index out. (4) When git-write-tree starts up after git-apply finishes its job, it notices index.aux file matches the fingerprint of the index (so it is up-to-date) and some entries have been _invalidated_. When it writes out the index, the parts of the directory hierarchy that have not been invalidated do not have to be formatted in-core into "tree" -- we already know what tree object they are. Only the parts that were invalidated need to be computed. For this, a new data structure "cache_tree" is introduced. The way to use the API is very simple. You initialize it like this: struct cache_tree *active_cache_tree; unsigned char sha1[20]; read_cache_1(sha1); active_cache_tree = read_cache_tree(sha1); The new function read_cache_1() is similar to the traditional read_cache(), but it gives back the index file fingerprint. This is given to read_cache_tree() to make sure index.aux is not stale. Then, when you touch a path in the cache, you invalidate the paths in the cache_tree with: cache_tree_invalidate_path(active_cache_tree, path); For example, if you are updating an index entry "fs/ext3/inode.c", it invalidates the cached tree SHA1 information for top-level tree, "fs" tree and "fs/ext3" tree. All other information you read from index.aux are still valid (IOW, three out of 1200 are invalidated). When you are done, you write out the index file and the updated index.aux file like this: if (write_cache_1(newfd, active_cache, active_nr, active_cache_sha1) || commit_index_file(&cache_file)) die("Unable to write new cachefile"); write_cache_tree(active_cache_sha1, active_cache_tree); Again, write_cache_1() has one extra parameter to receive the updated index file fingerprint, and you call write_cache_tree() with it. It updates index.aux file (with invalidated entries). One thing write-tree does different is this (note that it does not update the index file, it just reads from it): if (cache_tree_update(active_cache_tree, active_cache, active_nr, missing_ok)) die("git-write-tree: error building trees"); write_cache_tree(active_cache_sha1, active_cache_tree); The call to cache_tree_update() takes the (possibly partially invalidated) cache_tree, fills in the invalidated parts from the index file (this involves creation of new "tree" objects as needed). So the above call to write_cache_tree() writes a complete index.aux file out without any invalidated entries. Obviously, the change similar to what I did to apply.c in this series can be done to other index-updating commands. That's "low hanging fruit" for the week ;-). ^ permalink raw reply [flat|nested] 8+ messages in thread
* maintenance of cache-tree data 2006-04-24 5:41 ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano @ 2006-04-24 21:31 ` Junio C Hamano 2006-04-24 22:34 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 21:31 UTC (permalink / raw) To: git; +Cc: Linus Torvalds Junio C Hamano <junkio@cox.net> writes: > (1) When git-write-tree writes out trees from the index, we > store <directory, tree SHA1> pair for all the trees we > compute, in .git/index.aux file. There are two reasons I did not make this extra information part of the index file. The number one reason was because the current index file format is pretty dense, and I did not find an obvious hole in the front, in the middle or at the tail to sneak extra data in without upsetting existing code and without updating index file version. If this were a change to add a great new feature, that might have warranted bumping the version up, but the cache-tree is an optimization and if you lose that information, all you have lost is that now your write-tree needs to recompute the whole tree as before (IOW, not much). The second reason was I wanted to do this step-by-step, and wanted to do a demonstration of an end-to-end workflow that gains from this set of changes (apply followed by write-tree was an obvious minimum set of commands), and while doing so I did not have to disrupt other commands that are unaware of this extension. Having said that, cache-tree.c has an internal interface to serialize the cache-tree data in-core, primarily because I was unsure if I wanted to append this information somewhere inside the main index file or have it external when I did it; so we could later push it into the index if we wanted to. Ideally, everybody who writes index should be converted to update cache-tree to help eventual write-tree. Right now, if a command updates index without updating a matching cache-tree, the whole thing is invalidated; this way, you do not risk using stale "cached" data. Currently the command that primes the cache-tree is write-tree. This may be counterintuitive -- even I myself would expect that read-tree would prime it, and various index-updaters invalidate subtrees they touch, and write-tree to use the surviving parts to speed up what it needs to do, and write out an updated, fully-vaild cache-tree. I did not do it only because it was not necessary for "apply then write-tree" cycle, but read-tree should be taught about cache-tree to help others, _and_ to help itself. When "read-tree -m O A B" merges three trees, we iterate over all index entries, even when only a small part of the tree has changed. This could be helped in a big way if the current index has valid cache-tree information for the parts unaffected by the merge. If all three trees have identical tree in higher level subdirectory (e.g. "fs/" in the kernel source), and if the index has not touched anything in "fs/" since it read from our tree (i.e. "A"), then we do not even have to descend into that directory in the working tree to see which index entries are dirty. We can just keep index entries that begin with "fs/" intact, keep the cache-tree entry for that directory as it was read from "A". This would be a big win -- we do not have to read tree objects under "fs/" (there are 62 trees under "fs/", so we save uncompressing and undeltifying 180 objects). This operation would need to invalidate entries in cache-tree that are involved in the actual merge. Branch-switching two-tree form "read-tree -m OLD NEW" probably can benefit from the same optimization. Obviously, a single-tree form of read-tree should be able to prime the cache-tree with fully valid data before writing the index out. Having said that, I am not touching read-tree myself for now; I am lazy. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: maintenance of cache-tree data 2006-04-24 21:31 ` maintenance of cache-tree data Junio C Hamano @ 2006-04-24 22:34 ` Junio C Hamano 2006-04-25 23:05 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2006-04-24 22:34 UTC (permalink / raw) To: git; +Cc: Linus Torvalds Junio C Hamano <junkio@cox.net> writes: > The number one reason was because the current index file format > is pretty dense, and I did not find an obvious hole in the > front, in the middle or at the tail to sneak extra data in > without upsetting existing code and without updating index file > version. Well, I was blind ;-). As long as the whole-file SHA1 matches, read_cache() does not care if we have extra data after the series of active_nr cache entry data in the index file. I'm working on a patch now. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: maintenance of cache-tree data 2006-04-24 22:34 ` Junio C Hamano @ 2006-04-25 23:05 ` Junio C Hamano 0 siblings, 0 replies; 8+ messages in thread From: Junio C Hamano @ 2006-04-25 23:05 UTC (permalink / raw) To: git; +Cc: Linus Torvalds Junio C Hamano <junkio@cox.net> writes: > Well, I was blind ;-). As long as the whole-file SHA1 matches, > read_cache() does not care if we have extra data after the > series of active_nr cache entry data in the index file. > > I'm working on a patch now. So I did. There is one bad thing; so far "write-tree" was a read-only consumer of the index file, but now it primes the cache-tree structure and needs to update the index. But that is minor. While I was at it, I made this "stuffing extra cruft in the index" slightly more generic than I needed it for this particular application. What I see this _might_ be useful for are: - We would want to store which commit of a subproject a particular subdirectory came from. This was one missing piece from the "bind commit" proposal that wasn't implemented in the jc/bind branch. - We might want to record "at this path there is a directory, albeit empty"; this cannot be expressed with an usual index entry. We might be able to use cache-tree for that, but I think this is something different at the logical level. While cache-tree is to be fully populated (by write-tree and perhaps read-tree later) and invalidated partially when update-index and friends smudge part of the tree, this is not something we would want to even invalidate (IOW, it should always be up-to-date), so they serve different purposes. I still haven't looked at the read-tree yet, but as I outlined in a previous message, its intra-index merge could take advantage of cache-tree. "diff-index", especially "--cached" kind, also could use it to skip unchanged subtrees altogether. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-04-25 23:06 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-23 23:52 [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply Junio C Hamano 2006-04-24 1:25 ` Junio C Hamano 2006-04-24 2:47 ` Junio C Hamano 2006-04-24 3:03 ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano 2006-04-24 5:41 ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano 2006-04-24 21:31 ` maintenance of cache-tree data Junio C Hamano 2006-04-24 22:34 ` Junio C Hamano 2006-04-25 23:05 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).