git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cogito: Avoid slowness when timewarping large trees.
@ 2006-03-24  8:44 Jeff King
  2006-03-24 10:21 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2006-03-24  8:44 UTC (permalink / raw)
  To: git


tree_timewarp was calling read, egrep, and rm in an O(N) loop where N is
the number of changed files between two trees. This caused a bottleneck
when seeking/switching/merging between trees with many changed files.

On the historical linux tree, the time to cg-seek from the head to the
initial commit (a change of 19099 files) dropped from 2m35s to 21s.

---

 cg-Xlib |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

a9a160c0bd63973c53ba3aa74650728135d23ac7
diff --git a/cg-Xlib b/cg-Xlib
index a2f28cf..ceddeeb 100644
--- a/cg-Xlib
+++ b/cg-Xlib
@@ -345,12 +345,9 @@ tree_timewarp()
 
 	# Kill gone files
 	git-diff-tree -r "$base" "$branch" |
-		while IFS=$'\t' read header file; do
-			# match ":100755 000000 14d43b1abf... 000000000... D"
-			if echo "$header" | egrep "^:([^ ][^ ]* ){4}D" >/dev/null; then
-				rm -- "$file"
-			fi
-		done
+		# match ":100755 000000 14d43b1abf... 000000000... D"
+		sed -ne 's/^:\([^ ][^ ]* \)\{4\}D\t//p' |
+		xargs rm --
 	git-checkout-index -u -f -a
 
 	# FIXME: Can produce bogus "contains only garbage" messages.
-- 
1.2.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24  8:44 [PATCH] cogito: Avoid slowness when timewarping large trees Jeff King
@ 2006-03-24 10:21 ` Junio C Hamano
  2006-03-24 10:55   ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2006-03-24 10:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> tree_timewarp was calling read, egrep, and rm in an O(N) loop where N is
> the number of changed files between two trees. This caused a bottleneck
> when seeking/switching/merging between trees with many changed files.
>...
> ---
>
>  cg-Xlib |    9 +++------
>  1 files changed, 3 insertions(+), 6 deletions(-)
>
> a9a160c0bd63973c53ba3aa74650728135d23ac7
> diff --git a/cg-Xlib b/cg-Xlib
> index a2f28cf..ceddeeb 100644
> --- a/cg-Xlib
> +++ b/cg-Xlib
> @@ -345,12 +345,9 @@ tree_timewarp()
>  
>  	# Kill gone files
>  	git-diff-tree -r "$base" "$branch" |
> ...
> +		# match ":100755 000000 14d43b1abf... 000000000... D"
> +		sed -ne 's/^:\([^ ][^ ]* \)\{4\}D\t//p' |
> +		xargs rm --
>  	git-checkout-index -u -f -a

Metainformation fields are internally separated with SP and a
TAB comes before pathname; you can just say:

	sed -ne 's/^:[^	]* D	//p'

there (whitespace inside [] and after D are TAB; one before D is
a SP).  You might also want to consider "xargs rm -f --", BTW.

However, I wonder why it does not do this instead:

	... stash away the local changes
	git-read-tree -m "$base" ;# reset the index to $base

	# switch to $branch -- removing gone files as well
	git-read-tree -m -u "$base" "$branch"

Then you can also lose diff-tree and checkout-index there.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24 10:21 ` Junio C Hamano
@ 2006-03-24 10:55   ` Jeff King
  2006-03-24 11:01     ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2006-03-24 10:55 UTC (permalink / raw)
  To: git

On Fri, Mar 24, 2006 at 02:21:25AM -0800, Junio C Hamano wrote:

> Metainformation fields are internally separated with SP and a
> TAB comes before pathname; you can just say:
> 
> 	sed -ne 's/^:[^	]* D	//p'

That is much cleaner (I stupidly just converted the original regex
verbatim).

> a SP).  You might also want to consider "xargs rm -f --", BTW.

Oops, you're right. In particular, rm complains when there are no
deletions. 

> However, I wonder why it does not do this instead:
> 
> 	... stash away the local changes
> 	git-read-tree -m "$base" ;# reset the index to $base
> 
> 	# switch to $branch -- removing gone files as well
> 	git-read-tree -m -u "$base" "$branch"
> 
> Then you can also lose diff-tree and checkout-index there.

This doesn't deal very well with local changes. The second read-tree
complains about a not uptodate entry during the merge. Since we've
already stashed the local changes as a diff, we should be able to simply
ignore them during the read-tree. Should the first read-tree actually
be:
  git-read-tree --reset "$base"
?

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24 10:55   ` Jeff King
@ 2006-03-24 11:01     ` Junio C Hamano
  2006-03-24 11:22       ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2006-03-24 11:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> ... Should the first read-tree actually
> be:
>   git-read-tree --reset "$base"

Exactly.  That's what I meant.  Thanks.

Originally I wrote "git reset" there, but this being Cogito I
thought Pasky preferred to use the true Plumbing and I botched
it X-<.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24 11:01     ` Junio C Hamano
@ 2006-03-24 11:22       ` Jeff King
  2006-03-24 16:43         ` Shawn Pearce
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2006-03-24 11:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Mar 24, 2006 at 03:01:35AM -0800, Junio C Hamano wrote:

> >   git-read-tree --reset "$base"
> Exactly.  That's what I meant.  Thanks.

Hmm. That doesn't actually work, though. If I have a history like this:

$ cg-init -m "initial"
$ cg-tag initial
$ echo contents >file
$ cg-add file
$ cg-commit -m "added file"

and I try this:
$ echo changes >file
$ git-read-tree --reset master
$ git-read-tree -m -u master initial

I get this:
fatal: Entry 'file' not uptodate. Cannot merge.

If I do an update-index before the second read-tree, then I simply get:
fatal: Entry 'file' would be overwritten by merge. Cannot merge.

Is there something I'm missing, or is a 'git reset --hard' really what
we want here (in that case, the fact that git reset changes the HEAD
might be a problem)?

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24 11:22       ` Jeff King
@ 2006-03-24 16:43         ` Shawn Pearce
  2006-03-25  9:36           ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2006-03-24 16:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King <peff@peff.net> wrote:
> On Fri, Mar 24, 2006 at 03:01:35AM -0800, Junio C Hamano wrote:
> 
> > >   git-read-tree --reset "$base"
> > Exactly.  That's what I meant.  Thanks.
> 
> Hmm. That doesn't actually work, though. If I have a history like this:
> 
> $ cg-init -m "initial"
> $ cg-tag initial
> $ echo contents >file
> $ cg-add file
> $ cg-commit -m "added file"
> 
> and I try this:
> $ echo changes >file
> $ git-read-tree --reset master
> $ git-read-tree -m -u master initial
> 
> I get this:
> fatal: Entry 'file' not uptodate. Cannot merge.
> 
> If I do an update-index before the second read-tree, then I simply get:
> fatal: Entry 'file' would be overwritten by merge. Cannot merge.
> 
> Is there something I'm missing, or is a 'git reset --hard' really what
> we want here (in that case, the fact that git reset changes the HEAD
> might be a problem)?


This is sort of what I'm doing in pg-reset-tree, which is kind
of like 'git-reset --hard' but I think it is faster when $force
is unset:

  # Remove files left over from merge conflicts and files which are
  # somehow modified.  If this makes a directory empty it may have
  # been a new directory so delete that too.
  #
  (git-ls-files -z \
    --others \
    --ignored \
    --exclude='*#1' \
    --exclude='*#2' \
    --exclude='*#3' \
    --exclude='*.rej'
   git-diff-index --name-only -z HEAD
  ) | perl -n0e 'chomp; unlink; 1 while (s,/[^/]*$,, && rmdir)'
  
  # Rebuild the index and working directory.  We'll only checkout the
  # files which don't exist.  This resets the modified files we deleted
  # just above; remaining files will have their stat information updated
  # in the index.
  #
  git-read-tree --reset HEAD &&
  git-checkout-index --index --all $force \
    || die "Can't reset index and working directory."

  # Now that the working directory is clean we can safely merge it to
  # to our target tree, $new_base.
  #
  git-read-tree -m -u HEAD $new_base

The $force in git-checkout-index may or may not be set to
'--force'; its usually not set as its not usually necessary.
Unfortunately I've got a case where I'm mounting a directory
exported by SAMBA onto a Windows 2000 system and if I don't include
--force during git-checkout-index it doesn't work right about 1/3
of the time.  (It appears to be bad stat information coming from
Cygwin/Windows/SAMBA/Solaris.)

You can't skip the git-checkout-index step (I've tried) as the
ls-files/diff-index above causes the modified files (in your test
above 'changes') to disappear from the working directory and the
read-tree may not bring it back.

Now that I think about it isn't this sort of where you were before
in cg-seek?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-24 16:43         ` Shawn Pearce
@ 2006-03-25  9:36           ` Jeff King
  2006-03-25  9:39             ` [PATCH] " Jeff King
  2006-03-25 19:55             ` [PATCH] cogito: " Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2006-03-25  9:36 UTC (permalink / raw)
  To: git

On Fri, Mar 24, 2006 at 11:43:52AM -0500, Shawn Pearce wrote:

> Now that I think about it isn't this sort of where you were before
> in cg-seek?

Yes, that's basically it. Short of Junio explaining how the manual file
removal can be avoided, I think my original patch should be applied, as
it causes an order of magnitude speed up. I will repost the cleaned-up
version.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Avoid slowness when timewarping large trees.
  2006-03-25  9:36           ` Jeff King
@ 2006-03-25  9:39             ` Jeff King
  2006-03-26 18:05               ` Petr Baudis
  2006-03-25 19:55             ` [PATCH] cogito: " Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2006-03-25  9:39 UTC (permalink / raw)
  To: git; +Cc: pasky


tree_timewarp was calling read, egrep, and rm in an O(N) loop where N is
the number of changed files between two trees. This caused a bottleneck
when seeking/switching/merging between trees with many changed files.

Signed-off-by: Jeff King <peff@peff.net>


---

This is a repost of the initial patch featuring a few cleanups suggested
by Junio. 

 cg-Xlib |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

5f79b37a0eb85ff4f643e70a7f2823e68e9d9ca4
diff --git a/cg-Xlib b/cg-Xlib
index 5896df7..1a9bd4f 100644
--- a/cg-Xlib
+++ b/cg-Xlib
@@ -363,12 +363,9 @@ tree_timewarp()
 
 	# Kill gone files
 	git-diff-tree -r "$base" "$branch" |
-		while IFS=$'\t' read header file; do
-			# match ":100755 000000 14d43b1abf... 000000000... D"
-			if echo "$header" | egrep "^:([^ ][^ ]* ){4}D" >/dev/null; then
-				rm -- "$file"
-			fi
-		done
+		# match ":100755 000000 14d43b1abf... 000000000... D"
+		sed -ne 's/^:[^\t]* D\t//p' |
+		xargs rm -f --
 	git-checkout-index -u -f -a
 
 	# FIXME: Can produce bogus "contains only garbage" messages.
-- 
1.2.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] cogito: Avoid slowness when timewarping large trees.
  2006-03-25  9:36           ` Jeff King
  2006-03-25  9:39             ` [PATCH] " Jeff King
@ 2006-03-25 19:55             ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2006-03-25 19:55 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Fri, Mar 24, 2006 at 11:43:52AM -0500, Shawn Pearce wrote:
>
>> Now that I think about it isn't this sort of where you were before
>> in cg-seek?
>
> Yes, that's basically it. Short of Junio explaining how the manual file
> removal can be avoided,...

Please don't let me slow you down -- that "git-read-tree --reset"
one was just confused.  I was not thinking clearly about the
modification you already had in the working tree which you
wanted to temporarily reset while switching.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Avoid slowness when timewarping large trees.
  2006-03-25  9:39             ` [PATCH] " Jeff King
@ 2006-03-26 18:05               ` Petr Baudis
  0 siblings, 0 replies; 10+ messages in thread
From: Petr Baudis @ 2006-03-26 18:05 UTC (permalink / raw)
  To: git

Dear diary, on Sat, Mar 25, 2006 at 10:39:57AM CET, I got a letter
where Jeff King <peff@peff.net> said that...
> tree_timewarp was calling read, egrep, and rm in an O(N) loop where N is
> the number of changed files between two trees. This caused a bottleneck
> when seeking/switching/merging between trees with many changed files.
> 
> Signed-off-by: Jeff King <peff@peff.net>

Thanks, applied.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-03-26 18:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-24  8:44 [PATCH] cogito: Avoid slowness when timewarping large trees Jeff King
2006-03-24 10:21 ` Junio C Hamano
2006-03-24 10:55   ` Jeff King
2006-03-24 11:01     ` Junio C Hamano
2006-03-24 11:22       ` Jeff King
2006-03-24 16:43         ` Shawn Pearce
2006-03-25  9:36           ` Jeff King
2006-03-25  9:39             ` [PATCH] " Jeff King
2006-03-26 18:05               ` Petr Baudis
2006-03-25 19:55             ` [PATCH] cogito: " Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).