[PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply.
@ 2006-04-23 23:52 Junio C Hamano
  2006-04-24  1:25 ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-04-23 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

This updates git-apply to maintain cache-tree information.  With
this and the previous write-tree patch, repeated "apply --index"
followed by "write-tree" on a huge tree will hopefully become
faster.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * ... then the big rock falls.  With this, I tried to apply and
   then write-tree "diff-tree -p $commit^1 $commit" on top of
   "$commit^1" for the last 20 or so commits in the kernel tree.
   The "master" version takes 0.15 second per patch on my Duron
   750 with 700MB, while this one does that in 0.06 second.
   This also helps the memory pressure because we do not have to
   regenerate unchanged trees.  810 minor faults with the patch
   vs 2150 minor faults without.

 apply.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

94005d7bb4b55e9984f6505a8916d23d304ccc4d
diff --git a/apply.c b/apply.c
index 269210a..f7cdefa 100644
--- a/apply.c
+++ b/apply.c
@@ -8,9 +8,14 @@
  */
 #include <fnmatch.h>
 #include "cache.h"
+#include "cache-tree.h"
 #include "quote.h"
 #include "blob.h"
 
+static unsigned char active_cache_sha1[20];
+static struct cache_tree *active_cache_tree;
+
+
 //  --check turns on checking that the working tree matches the
 //    files that are being modified, but doesn't apply the patch
 //  --stat does just a diffstat, and doesn't actually apply
@@ -1717,6 +1722,7 @@ static void remove_file(struct patch *pa
 	if (write_index) {
 		if (remove_file_from_cache(patch->old_name) < 0)
 			die("unable to remove %s from index", patch->old_name);
+		cache_tree_invalidate_path(active_cache_tree, patch->old_name);
 	}
 	unlink(patch->old_name);
 }
@@ -1815,6 +1821,7 @@ static void create_file(struct patch *pa
 		mode = S_IFREG | 0644;
 	create_one_file(path, mode, buf, size);	
 	add_index_file(path, mode, buf, size);
+	cache_tree_invalidate_path(active_cache_tree, path);
 }
 
 static void write_out_one_result(struct patch *patch)
@@ -1912,8 +1919,9 @@ static int apply_patch(int fd, const cha
 	if (write_index)
 		newfd = hold_index_file_for_update(&cache_file, get_index_file());
 	if (check_index) {
-		if (read_cache() < 0)
+		if (read_cache_1(active_cache_sha1) < 0)
 			die("unable to read index file");
+		active_cache_tree = read_cache_tree(active_cache_sha1);
 	}
 
 	if ((check || apply) && check_patch_list(list) < 0)
@@ -1923,9 +1931,13 @@ static int apply_patch(int fd, const cha
 		write_out_results(list, skipped_patch);
 
 	if (write_index) {
-		if (write_cache(newfd, active_cache, active_nr) ||
+		if (write_cache_1(newfd, active_cache, active_nr,
+				  active_cache_sha1) ||
 		    commit_index_file(&cache_file))
 			die("Unable to write new cachefile");
+		cache_tree_update(active_cache_tree,
+				  active_cache, active_nr, 1);
+		write_cache_tree(active_cache_sha1, active_cache_tree);
 	}
 
 	if (show_index_info)
-- 
1.3.0.g623a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply.
  2006-04-23 23:52 [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply Junio C Hamano
@ 2006-04-24  1:25 ` Junio C Hamano
  2006-04-24  2:47   ` Junio C Hamano
  2006-04-24  3:03   ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24  1:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

>  * ... then the big rock falls.  With this, I tried to apply and
>    then write-tree "diff-tree -p $commit^1 $commit" on top of
>    "$commit^1" for the last 20 or so commits in the kernel tree.
>    The "master" version takes 0.15 second per patch on my Duron
>    750 with 700MB, while this one does that in 0.06 second.
>    This also helps the memory pressure because we do not have to
>    regenerate unchanged trees.  810 minor faults with the patch
>    vs 2150 minor faults without.

Sorry, but not really.  The patch is wrong and the measurement
was flawed.

It was doing unnecessarily more work in git-apply, which made
git-write-tree a no-op, and I was measuring only git-write-tree.

The following goes on top of the series to remove that
unnecessary work from git-apply.  Unfortunately this makes the
overall combination a bit slower than before X-<.

But I have not optimized cache-tree.c for speed; for example,
its subtree lists are not even sorted.  Once that is done we may
get decent speedups from the combo.

---
diff --git a/apply.c b/apply.c
index 5fa2c1e..e283df3 100644
--- a/apply.c
+++ b/apply.c
@@ -1935,8 +1935,6 @@ static int apply_patch(int fd, const cha
 				  active_cache_sha1) ||
 		    commit_index_file(&cache_file))
 			die("Unable to write new cachefile");
-		cache_tree_update(active_cache_tree,
-				  active_cache, active_nr, 1);
 		write_cache_tree(active_cache_sha1, active_cache_tree);
 	}
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply.
  2006-04-24  1:25 ` Junio C Hamano
@ 2006-04-24  2:47   ` Junio C Hamano
  2006-04-24  3:03   ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24  2:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> Junio C Hamano <junkio@cox.net> writes:
>
>>  * ... then the big rock falls.  With this, I tried to apply and
>>    then write-tree "diff-tree -p $commit^1 $commit" on top of
>>    "$commit^1" for the last 20 or so commits in the kernel tree.
>>    The "master" version takes 0.15 second per patch on my Duron
>>    750 with 700MB, while this one does that in 0.06 second.
>>    This also helps the memory pressure because we do not have to
>>    regenerate unchanged trees.  810 minor faults with the patch
>>    vs 2150 minor faults without.
>
> Sorry, but not really.  The patch is wrong and the measurement
> was flawed.

Again, sorry, but there are some more bugs in the cache-tree
code that I need to fix and re-measure.

In the meantime, please do not use it on your production
repositories.  It does not seem to produce corrupt trees, but it
creates broken index.aux for no good reason.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/4] Minimum fixups to cache-tree
  2006-04-24  1:25 ` Junio C Hamano
  2006-04-24  2:47   ` Junio C Hamano
@ 2006-04-24  3:03   ` Junio C Hamano
  2006-04-24  5:41     ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24  3:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

The first hunk is the repeat of the previous "oops".  The second
is a real fix.

With this applied, 100-patch series (going from present to past
at the Linux 2.6 kernel tip) applies correctly with the
following stats:

Trying w/o the patch...
24.57user 4.48system 0:34.47elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+322156minor)pagefaults 0swaps
Trying w/ the patch...
15.81user 3.00system 0:20.84elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+252063minor)pagefaults 0swaps

So there definitely is an improvement.  *Grin*.

The script used to bench this is attached at the end.  "linus" is the tracking
branch and currently it points at v2.6.17-rc2-g6b426e7
$HOME/git-master/bin is where I installed the "master" version,
and $J (aka ../git.junio/) is the freshly compiled area with the
patch series applied (all five of four ;-).

-- >8 --
Minimum fixups to make things usable.

---
diff --git a/apply.c b/apply.c
index 5fa2c1e..e283df3 100644
--- a/apply.c
+++ b/apply.c
@@ -1935,8 +1935,6 @@ static int apply_patch(int fd, const cha
 				  active_cache_sha1) ||
 		    commit_index_file(&cache_file))
 			die("Unable to write new cachefile");
-		cache_tree_update(active_cache_tree,
-				  active_cache, active_nr, 1);
 		write_cache_tree(active_cache_sha1, active_cache_tree);
 	}
 
diff --git a/cache-tree.c b/cache-tree.c
index 4dbdb65..f6d1dd1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -348,10 +348,7 @@ #endif
 	}
 	for (i = 0; i < it->subtree_nr; i++) {
 		struct cache_tree_sub *down = it->down[i];
-		int len = pathlen + down->namelen;
-		memcpy(path + pathlen, down->name, down->namelen);
-		path[len] = '/';
-		buffer = write_one(down->cache_tree, path, len+1,
+		buffer = write_one(down->cache_tree, down->name, down->namelen,
 				   buffer, size, offset);
 	}
 	return buffer;

--

#!/bin/sh
J=../git.junio
M=$HOME/git-master/bin
export J M

git reset --hard linus

previous=
git rev-list -n 100 HEAD |
while read commit
do
	if test -n "$previous"
	then
		git-diff-tree -p "$previous" "$commit" >"$commit-$previous"
		echo "$commit $previous"
	fi
	previous="$commit"
done >series

git reset --hard linus
echo "Trying w/o the patch..."
/usr/bin/time sh -c '
	while read commit previous
	do
		$M/git-apply --whitespace=nowarn --index "$commit-$previous"
		$M/git-write-tree
	done <series
' >out-1
tree1=`git write-tree`

git reset --hard linus
echo "Trying w/ the patch..."
$J/git-write-tree
/usr/bin/time sh -c '
	while read commit previous
	do
		$J/git-apply --whitespace=nowarn --index "$commit-$previous"
		$J/git-write-tree
	done <series
' >out-2
tree2=`git write-tree`
test "$tree1" = "$tree2" || exit 1
cmp out-1 out-2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* 50% speed-up of "git-apply && git-write-tree" sequence.
  2006-04-24  3:03   ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano
@ 2006-04-24  5:41     ` Junio C Hamano
  2006-04-24 21:31       ` maintenance of cache-tree data Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24  5:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

I've fixed-up the series and they are now sitting near the tip
of "pu" branch tonight.  With some more testing I am hoping to
merge it in "next" soon.

I should recap the overall idea for general public here.

When applying a series of patches to your working tree and
making commits out of the results, what happens is roughly this:

 1. Your index file exactly matches your HEAD commit; you may
    have local modifications to some paths in your working tree
    as long as they do not interfere with the patch application.

 2. "git-apply --index" reads the index file, reads the patch,
    applies them to both index and the working tree, and writes
    the updated index file out.

 3. "git-write-tree" reads the index file, writes the index
    entries out as a set of hierarchical tree objects, the top
    of which is wrapped in a new commit object that has the
    current HEAD as the parent, and the commit becomes the new
    HEAD.

You repeat 1-3 as many times as you apply patches.

When dealing with a huge tree, with a typical patch touching
only a handful paths, this is not very efficient.  A recent
kernel source tree has about 1200 directories, each of which
needs to be made into the "tree" object format in-core from the
data in the index file, and we hash it to see if we already have
that tree, and if not we write it out after compression (if we
do have it, we do not do anything after checking).

The updated code does this:

 (1) When git-write-tree writes out trees from the index, we
     store <directory, tree SHA1> pair for all the trees we
     compute, in .git/index.aux file.

 (2) When git-write-tree starts up, it reads the index file, and
     also reads the .git/index.aux file.  The latter file
     contains a fingerprint of the former, so if the index was
     updated without updating the index.aux, what is read is
     discarded.  However, if .git/index.aux is up-to-date, what
     it records can be reused when we scan the index.

 (3) There are a few commands that updates the index file.
     git-update-index is the most well-known one, but read-tree
     and apply also update it.

     No command other than apply updates index.aux when writing
     out index (yet).  However, git-apply has been told about
     it.  It reads index and index.aux file, and when it updates
     an index entry, it _invalidates_ the directory information
     it read from index.aux file, and writes the updated
     index.aux file out when it is done.

     Note.  Earlier I sent out a fix ([PATCH 5/4]) that
     contained a two-line removal from apply.c; the code was
     doing more than just invalidating entries, but actually
     recomputing the tree.  The fixed code does not hash tree
     information when git-apply writes index out.

 (4) When git-write-tree starts up after git-apply finishes its
     job, it notices index.aux file matches the fingerprint of
     the index (so it is up-to-date) and some entries have been
     _invalidated_.  When it writes out the index, the parts of
     the directory hierarchy that have not been invalidated do
     not have to be formatted in-core into "tree" -- we already
     know what tree object they are.  Only the parts that were
     invalidated need to be computed.

For this, a new data structure "cache_tree" is introduced.  The
way to use the API is very simple.  You initialize it like this:

	struct cache_tree *active_cache_tree;
	unsigned char sha1[20];
	read_cache_1(sha1);
        active_cache_tree = read_cache_tree(sha1);

The new function read_cache_1() is similar to the traditional
read_cache(), but it gives back the index file fingerprint.
This is given to read_cache_tree() to make sure index.aux is not
stale.  Then, when you touch a path in the cache, you invalidate
the paths in the cache_tree with:

	cache_tree_invalidate_path(active_cache_tree, path);

For example, if you are updating an index entry
"fs/ext3/inode.c", it invalidates the cached tree SHA1
information for top-level tree, "fs" tree and "fs/ext3" tree.
All other information you read from index.aux are still valid
(IOW, three out of 1200 are invalidated).

When you are done, you write out the index file and the updated
index.aux file like this:

        if (write_cache_1(newfd, active_cache, active_nr, active_cache_sha1) ||
            commit_index_file(&cache_file))
                die("Unable to write new cachefile");
        write_cache_tree(active_cache_sha1, active_cache_tree);

Again, write_cache_1() has one extra parameter to receive the
updated index file fingerprint, and you call write_cache_tree()
with it.  It updates index.aux file (with invalidated entries).

One thing write-tree does different is this (note that it does
not update the index file, it just reads from it):

	if (cache_tree_update(active_cache_tree, active_cache, active_nr,
			      missing_ok))
		die("git-write-tree: error building trees");
	write_cache_tree(active_cache_sha1, active_cache_tree);

The call to cache_tree_update() takes the (possibly partially
invalidated) cache_tree, fills in the invalidated parts from the
index file (this involves creation of new "tree" objects as
needed).  So the above call to write_cache_tree() writes a
complete index.aux file out without any invalidated entries.

Obviously, the change similar to what I did to apply.c in this
series can be done to other index-updating commands.  That's
"low hanging fruit" for the week ;-).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* maintenance of cache-tree data
  2006-04-24  5:41     ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano
@ 2006-04-24 21:31       ` Junio C Hamano
  2006-04-24 22:34         ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24 21:31 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

Junio C Hamano <junkio@cox.net> writes:

>  (1) When git-write-tree writes out trees from the index, we
>      store <directory, tree SHA1> pair for all the trees we
>      compute, in .git/index.aux file.

There are two reasons I did not make this extra information part
of the index file.

The number one reason was because the current index file format
is pretty dense, and I did not find an obvious hole in the
front, in the middle or at the tail to sneak extra data in
without upsetting existing code and without updating index file
version.  If this were a change to add a great new feature, that
might have warranted bumping the version up, but the cache-tree
is an optimization and if you lose that information, all you
have lost is that now your write-tree needs to recompute the
whole tree as before (IOW, not much).

The second reason was I wanted to do this step-by-step, and
wanted to do a demonstration of an end-to-end workflow that
gains from this set of changes (apply followed by write-tree
was an obvious minimum set of commands), and while doing so I
did not have to disrupt other commands that are unaware of this
extension.

Having said that, cache-tree.c has an internal interface to
serialize the cache-tree data in-core, primarily because I was
unsure if I wanted to append this information somewhere inside
the main index file or have it external when I did it; so we
could later push it into the index if we wanted to.

Ideally, everybody who writes index should be converted to
update cache-tree to help eventual write-tree.  Right now, if a
command updates index without updating a matching cache-tree,
the whole thing is invalidated; this way, you do not risk using
stale "cached" data.

Currently the command that primes the cache-tree is write-tree.
This may be counterintuitive -- even I myself would expect that
read-tree would prime it, and various index-updaters invalidate
subtrees they touch, and write-tree to use the surviving parts
to speed up what it needs to do, and write out an updated,
fully-vaild cache-tree.  I did not do it only because it was not
necessary for "apply then write-tree" cycle, but read-tree
should be taught about cache-tree to help others, _and_ to help
itself.

When "read-tree -m O A B" merges three trees, we iterate over
all index entries, even when only a small part of the tree has
changed.  This could be helped in a big way if the current index
has valid cache-tree information for the parts unaffected by the
merge.  If all three trees have identical tree in higher level
subdirectory (e.g. "fs/" in the kernel source), and if the index
has not touched anything in "fs/" since it read from our tree
(i.e. "A"), then we do not even have to descend into that
directory in the working tree to see which index entries are
dirty.  We can just keep index entries that begin with "fs/"
intact, keep the cache-tree entry for that directory as it was
read from "A".  This would be a big win -- we do not have to
read tree objects under "fs/" (there are 62 trees under "fs/",
so we save uncompressing and undeltifying 180 objects).  This
operation would need to invalidate entries in cache-tree that
are involved in the actual merge.  Branch-switching two-tree
form "read-tree -m OLD NEW" probably can benefit from the same
optimization.

Obviously, a single-tree form of read-tree should be able to
prime the cache-tree with fully valid data before writing the
index out.

Having said that, I am not touching read-tree myself for now; I
am lazy.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: maintenance of cache-tree data
  2006-04-24 21:31       ` maintenance of cache-tree data Junio C Hamano
@ 2006-04-24 22:34         ` Junio C Hamano
  2006-04-25 23:05           ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-04-24 22:34 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

Junio C Hamano <junkio@cox.net> writes:

> The number one reason was because the current index file format
> is pretty dense, and I did not find an obvious hole in the
> front, in the middle or at the tail to sneak extra data in
> without upsetting existing code and without updating index file
> version.

Well, I was blind ;-).  As long as the whole-file SHA1 matches,
read_cache() does not care if we have extra data after the
series of active_nr cache entry data in the index file.

I'm working on a patch now.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: maintenance of cache-tree data
  2006-04-24 22:34         ` Junio C Hamano
@ 2006-04-25 23:05           ` Junio C Hamano
  0 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2006-04-25 23:05 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

Junio C Hamano <junkio@cox.net> writes:

> Well, I was blind ;-).  As long as the whole-file SHA1 matches,
> read_cache() does not care if we have extra data after the
> series of active_nr cache entry data in the index file.
>
> I'm working on a patch now.

So I did.

There is one bad thing; so far "write-tree" was a read-only
consumer of the index file, but now it primes the cache-tree
structure and needs to update the index.  But that is minor.

While I was at it, I made this "stuffing extra cruft in the
index" slightly more generic than I needed it for this
particular application.  What I see this _might_ be useful for
are:

 - We would want to store which commit of a subproject a
   particular subdirectory came from.  This was one missing
   piece from the "bind commit" proposal that wasn't implemented
   in the jc/bind branch.

 - We might want to record "at this path there is a directory,
   albeit empty"; this cannot be expressed with an usual index
   entry.

   We might be able to use cache-tree for that, but I think this
   is something different at the logical level.  While
   cache-tree is to be fully populated (by write-tree and
   perhaps read-tree later) and invalidated partially when
   update-index and friends smudge part of the tree, this is not
   something we would want to even invalidate (IOW, it should
   always be up-to-date), so they serve different purposes.

I still haven't looked at the read-tree yet, but as I outlined
in a previous message, its intra-index merge could take
advantage of cache-tree.  "diff-index", especially "--cached"
kind, also could use it to skip unchanged subtrees altogether.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-04-25 23:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-23 23:52 [PATCH 4/4] Invalidate cache-tree entries for touched paths in git-apply Junio C Hamano
2006-04-24  1:25 ` Junio C Hamano
2006-04-24  2:47   ` Junio C Hamano
2006-04-24  3:03   ` [PATCH 5/4] Minimum fixups to cache-tree Junio C Hamano
2006-04-24  5:41     ` 50% speed-up of "git-apply && git-write-tree" sequence Junio C Hamano
2006-04-24 21:31       ` maintenance of cache-tree data Junio C Hamano
2006-04-24 22:34         ` Junio C Hamano
2006-04-25 23:05           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).