Git development
 help / color / mirror / Atom feed
* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 16:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git
In-Reply-To: <Pine.LNX.4.64.0606291053280.1213@localhost.localdomain>

On Thu, 29 Jun 2006, Nicolas Pitre wrote:

> And this can be pushed even further by just including the sha1 of the 
> victim object inside the list of objects therefore computing a hash of 
> all objects (the victim and the window) for which no delta results. The 
> cache is therefore a list of hash values corresponding to bad 
> victim+window combinations.
> 
> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Correction: the 40 bytes figure is for _ascii_ representation of sha1 
values.  The cache doesn't need ascii and therefore this number can be 
reduced by half.


Nicolas

^ permalink raw reply

* [PATCH] Allow INSTALL, bindir, mandir to be set in main Makefile
From: Jakub Narebski @ 2006-06-29 16:35 UTC (permalink / raw)
  To: git
In-Reply-To: <200606291704.27677.jnareb@gmail.com>

Makefiles in subdirectories now use existing value of INSTALL, bindir,
mandir if it is set, allowing those to be set in main Makefile or in
config.mak.  Main Makefile exports variables which it sets.

Renames man1 and man7 variables to man1dir and man7dir, according to
"Makefile Conventions: Variables for Installation Directories" in
make.info of GNU Make.  Renames bin to bindir (unused, perhaps to be
removed).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Part of autoconf series, but independent.

Should probably be split into two patches:
 * first with export + '?='
 * second renaming man1 and man7 to man1dir and man7dir

 Documentation/Makefile   |   14 +++++++-------
 Makefile                 |    2 ++
 contrib/emacs/Makefile   |    4 ++--
 contrib/git-svn/Makefile |    8 ++++----
 4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2b0efe7..cc83610 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -25,10 +25,10 @@ DOC_MAN1=$(patsubst %.txt,%.1,$(MAN1_TXT
 DOC_MAN7=$(patsubst %.txt,%.7,$(MAN7_TXT))
 
 prefix?=$(HOME)
-bin=$(prefix)/bin
-mandir=$(prefix)/man
-man1=$(mandir)/man1
-man7=$(mandir)/man7
+bindir?=$(prefix)/bin
+mandir?=$(prefix)/man
+man1dir=$(mandir)/man1
+man7dir=$(mandir)/man7
 # DESTDIR=
 
 INSTALL?=install
@@ -52,9 +52,9 @@ man1: $(DOC_MAN1)
 man7: $(DOC_MAN7)
 
 install: man
-	$(INSTALL) -d -m755 $(DESTDIR)$(man1) $(DESTDIR)$(man7)
-	$(INSTALL) $(DOC_MAN1) $(DESTDIR)$(man1)
-	$(INSTALL) $(DOC_MAN7) $(DESTDIR)$(man7)
+	$(INSTALL) -d -m755 $(DESTDIR)$(man1dir) $(DESTDIR)$(man7dir)
+	$(INSTALL) $(DOC_MAN1) $(DESTDIR)$(man1dir)
+	$(INSTALL) $(DOC_MAN7) $(DESTDIR)$(man7dir)
 
 
 #
diff --git a/Makefile b/Makefile
index cde619c..b8fe669 100644
--- a/Makefile
+++ b/Makefile
@@ -100,6 +100,8 @@ template_dir = $(prefix)/share/git-core/
 GIT_PYTHON_DIR = $(prefix)/share/git-core/python
 # DESTDIR=
 
+export prefix bindir gitexecdir template_dir GIT_PYTHON_DIR
+
 CC = gcc
 AR = ar
 TAR = tar
diff --git a/contrib/emacs/Makefile b/contrib/emacs/Makefile
index d3619db..350846d 100644
--- a/contrib/emacs/Makefile
+++ b/contrib/emacs/Makefile
@@ -3,9 +3,9 @@ ## Build and install stuff
 EMACS = emacs
 
 ELC = git.elc vc-git.elc
-INSTALL = install
+INSTALL ?= install
 INSTALL_ELC = $(INSTALL) -m 644
-prefix = $(HOME)
+prefix ?= $(HOME)
 emacsdir = $(prefix)/share/emacs/site-lisp
 
 all: $(ELC)
diff --git a/contrib/git-svn/Makefile b/contrib/git-svn/Makefile
index 7c20946..8cac688 100644
--- a/contrib/git-svn/Makefile
+++ b/contrib/git-svn/Makefile
@@ -1,9 +1,9 @@
 all: git-svn
 
 prefix?=$(HOME)
-bindir=$(prefix)/bin
-mandir=$(prefix)/man
-man1=$(mandir)/man1
+bindir?=$(prefix)/bin
+mandir?=$(prefix)/man
+man1dir=$(mandir)/man1
 INSTALL?=install
 doc_conf=../../Documentation/asciidoc.conf
 -include ../../config.mak
@@ -17,7 +17,7 @@ install: all
 	$(INSTALL) git-svn $(DESTDIR)$(bindir)
 
 install-doc: doc
-	$(INSTALL) git-svn.1 $(DESTDIR)$(man1)
+	$(INSTALL) git-svn.1 $(DESTDIR)$(man1dir)
 
 doc: git-svn.1
 git-svn.1 : git-svn.xml
-- 
1.4.0

^ permalink raw reply related

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 16:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060629035849.GA30749@coredump.intra.peff.net>

On Wed, 28 Jun 2006, Jeff King wrote:

> >From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.  This patch makes a
> persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
> hashes (the presence of a pair indicates that it's not worth trying to
> delta those two objects).

I think this could be done differently to be even more space efficient 
as I suggested earlier.

> Here are some of my thoughts:
> 
>  - speed. The implementation is quite fast. The sha1 pairs are stored
>    sorted, and we mmap and binary search them. Certainly the extra time
>    spent in lookup is justified by avoiding the delta attempts.

You do that lookup for every delta match attempt.  Instead it could be 
done once for the whole window attempt, potentially reducing the cache 
size by a factor of 20, and it might be faster too.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

See my previous email for comments about this.

>    Theoretically, I could bound the cache size and boot old entries.
>    However, that means storing age information, which increases the
>    required size. I think keeping it simple is best.

You could simply recreate the cache on each run.  Or just keep a bitmap 
of referenced cache entries, and a list of new entries.  At the end if 
the new entry list is empty and the bitmap is all set (or clear) then 
you just keep the current cache.  ONce, say, more than 10% of cache 
entries are not used anymore then you regenerate the cache.  But since 
you need to regenerate it when new entries are added then the 10% unused 
entry criteria might not be used that often anyway, unless packs shrink 
with time which might not be the case really often.

>  - correctness. Currently the code uses the output of try_delta for
>    negative caching. Should the cache checking be moved inside try_delta
>    instead? This would give more control over which reasons to mark a
>    delta negative (I believe right now hitting the depth limit will
>    cause a negative mark; we should perhaps only do so if the content
>    itself makes the delta unpalatable).

Like I said earlier I think this should be moved up a bit i.e. outside 
the delta loop.  In find_deltas() right before the ...

		j = window;
		while (--j > 0) {

loop, just insert another loop that compute the sha1 of the current 
target object and window:

		SHA1_Init(&c);
		j = window;
		while (j-- > 0) { /* postdec include the current object as well */
			unsigned int other_idx = idx + j;
			struct unpacked *m;
			if (other_idx >= window)
				other_idx -= window;
			m = array + other_idx;
			if (!m->entry)
				break;
			SHA1_Update(&c, m->entry.sha1, 20);
		}
		SHA1_Final(negative_delta_hash, &c);			

Then you can skip over the whole of the delta loop right away if that 
negative_delta_hash matches one entry in the cache since you already 
know that this object with this window doesn't produce any delta result.

Otherwise, after the delta loop has proceeded, if there is no delta 
found then you can add negative_delta_hash to the cache.

>  - optionalness. Currently the delta-cache is always used. Since it is a
>    space-time tradeoff, maybe it should be optional (it will have
>    negligible performance and horrible size impact on a repo that
>    consists of many very small but unrelated objects).  Possible methods
>    include:
>      - enable cache saves only if .git/delta-cache is present; turn it
>        on initially with 'touch .git/delta-cache'
>      - config variable

First, I think it should be ignored (but still created) when 
--no-reuse-delta is passed.  Then, it should not be created (but still 
looked up if it exists and --no-reuse-delta is not provided) when the 
pack index file is also not created.  I don't think it is worth making 
this further configurable, and given the suggested strategy above the 
cache should remain fairly small.


Nicolas

^ permalink raw reply

* Re: Improved three-way blob merging code
From: Linus Torvalds @ 2006-06-29 17:45 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, Git Mailing List, Davide Libenzi
In-Reply-To: <81b0412b0606290043s15e19b9fl853627e815f009ff@mail.gmail.com>



On Thu, 29 Jun 2006, Alex Riesen wrote:
>
> On 6/29/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > +static void *three_way_filemerge(mmfile_t *base, mmfile_t *our, mmfile_t
> > *their, unsigned long *size)
> > +{
> ...
> > +       if (t1 && t2 && t3) {
> > +               int code = run_command("merge", t2, t1, t3, NULL);
> 
> This does not use the labels of merge(1) and the merged file will contain
> the names of temporary files at conflict places, which is very confusing if
> you happen to loose context while doing a merge with lots of conflicts.

Yes. 

I was really really _really_ just hoping that I could do the nasty core 
code that others might feel uncomfortable with, and get it working from a 
_technical_ level well enough that others (hint hint) would decide that 
they can fix up the details.

Getting the first version working is often the hardest. When you reach the 
point of "it works, but I want to extend it to do X", you've usually 
already gotten pretty far.

Anyway, all of this was really just preparatory work to show what 
git-merge-tree does. A few more improvements to git-merge-tree, and 
hopefully it can start being useful - perhaps not initially for actually 
merging, but for doing a tree-level three-way diff between two branches.

In other words, my current goal is really do make it possible to get good 
diffs out of git from two branches that aren't directly related. You and 
Dscho seem to be doing well on the git-merge-recursive front, so my 
personal goal is actually to be able to get a saner diff than what

	git diff mine..theirs

gives you.

The above "git diff" is a perfectly fine thing to do, but it's usually not 
what you actually _want_. Almost always, what you want a diff between two 
branches, what you want is actually the diff after a merge of the 
branches, not the raw _current_ differences.

For example, look at our current 

	Documentation/howto/using-topic-branches.txt

file, and realize that the current scripts it suggests are actually 
broken:

	To create diffstat and shortlog summaries of changes to include in 
	a "pleasepull" request to Linus you can use:

	 $ git-whatchanged -p release ^linus | diffstat -p1
	and
	 $ git-whatchanged release ^linus | git-shortlog

where that "git-whatchanged -p release ^linus | diffstat -p1" won't 
actually be what I see when I merge (although it will hopefully be close 
enough). Also, there's no indication that the merge will fail when I pull, 
something that _would_ be very useful.

IOW, what I'd like git-merge-tree to do is to be able to at a minimum say:

 - will a merge succeed cleanly, and if not, show me where the problem 
   spots are.
 - what will the result of the merge look like.

because that's actually what a downstream developer wants to do. He'd just 
do

	git fetch linus
	git show-changes linus..my-branch

which would basically be the preparatory thing for sending me an email 
saying "please merge 'my-branch', and you'll see this".

Now, obviously, I think that there's a _lot_ of overlap between doing this 
and actually doing the merge itself, so hopefully the things I do will at 
least have some things in common and perhaps help you do the proper 
recursive merger.

But one thing I was actually hoping to do was to literally be able to do 
this without either tree being checked out, exactly so that you can check 
the status of a branch that you may not even _be_ on (eg that "my-branch" 
in the example above may not even be your current HEAD: you might be on 
your development branch when you actually ask me to pull from your stable 
branch).

So the "show differences" has a lot in common with "merge them", but there 
are literally a few stages missing. One thing missing in just showing 
differences is that you can't actually fix up the merge clashes: if you 
don't have a checked-out tree, you're not going to be able to do a real 
merge. But you can see what a merge would _look_ like, and whether there 
are any clashes that you will have to fix..

		Linus

^ permalink raw reply

* [PATCH] autoconf: Set mandir in config.mak.in and export variables not in Makefile
From: Jakub Narebski @ 2006-06-29 17:47 UTC (permalink / raw)
  To: git
In-Reply-To: <200606291835.53788.jnareb@gmail.com>

This patch with the previous "[PATCH] Allow INSTALL, bindir, mandir to
be set in main Makefile" patch allows ./configure script to set where
manpages will be installed using --mandir=DIR (./configure defaults to
PREFIX/man).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

Uwe Zeisberger wrote:
> autoconf does to much things, even with that little configure.ac.
> (But I agree, it's much better than automake :-)
>
> E.g.
>
>         ./configure --prefix=$HOME/usr --mandir=$HOME/usr/share/man
>
> is supported by the configure script, but the manpages are installed
> in $HOME/usr/man all the same.
>
> BTW: Even if I specify mandir=... in config.mak, it is not respected,
> because only the toplevel Makefile includes config.mak.  (I didn't
> test it, but I think I could export mandir in config.mak ...)

This patch and previous one adressess this.

 config.mak.in |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/config.mak.in b/config.mak.in
index 82d80e2..82c9781 100644
--- a/config.mak.in
+++ b/config.mak.in
@@ -8,5 +8,11 @@ #gitexecdir = @libexecdir@/git-core/
 template_dir = @datadir@/git-core/templates/
 GIT_PYTHON_DIR = @datadir@/git-core/python
 
+mandir=@mandir@
+
 srcdir = @srcdir@
 VPATH = @srcdir@
+
+export exec_prefix mandir
+export srcdir VPATH
+
-- 
1.4.0

^ permalink raw reply related

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:00 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606291053280.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 11:42:14AM -0400, Nicolas Pitre wrote:

> So the negative cache should not be O(N^2) either.  It just has to be 
> O(N*window).

Yes, unless we use the cache information to change the window (but I
don't think that's worth it, unless our window heuristics are bad).
The implementation I posted should already grow to O(N*window), since it
only saves what we try (and fail at).

> My past experiments showed that the best window size for compression is 
> a bit larger than the current default of 10.  It was rather around 15 
> for the kernel repository, with higher values than 15 not providing 
> significant improvements anymore.  But that is clearly a function of the 
> repository nature (the average number of revisions for each file).  But 
> the window size is directly connected to the computational cost.

Increasing the window, of course, has virtually no speed impact on later
repacks with the cache in effect. Packing with a window of 15 drops my
linux-2.6 pack to 108M from 122M. That helps offset the cache size
(though of course it takes longer on the initial pack).

> This is way suboptimal.  First there is no reason for the cache to ever 
> grow to N^2.  At worst it should be N*10 where 10 is the current window 
> size.

I assumed the window would change over time (though our total is still
likely to hang around N*10 rather than N^2).

> none of the objects found in a given window.  Therefore we only have to 
> compute a hash for the object names found in that window and store that 
> in the cache.  So the cache entries would then be a pair of sha1: first 
> the sha1 of the victim object, and the sha1 of all sha1 names for the 
> objects against which the victim object was found not to delta well 
> against.

This will fail to hit the cache anytime the window changes. How often
does the window change? In my test case, I would think anytime I added a
bunch of new photos, it would be likely that one of them would make it
into the window, thus invalidating the cache entry and forcing me to try
against every object in the window (even though I've already tried
9/10).

Also, is it true to say "if this object did not delta against this
window, it will never delta?" What about interactions with the depth
parameter?

> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Keep in mind that it will grow every time the window changes.

-Peff

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0606291154510.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:

> You do that lookup for every delta match attempt.  Instead it could be 
> done once for the whole window attempt, potentially reducing the cache 
> size by a factor of 20, and it might be faster too.

I'm not convinced this will provide good cache hit characteristics, and
I'm not convinced it's semantically correct (see my other mail).

> You could simply recreate the cache on each run.  Or just keep a bitmap 

Yes, that would probably work and would be quite easy to do with the
existing code.

> First, I think it should be ignored (but still created) when 
> --no-reuse-delta is passed.  Then, it should not be created (but still 
> looked up if it exists and --no-reuse-delta is not provided) when the 
> pack index file is also not created.  I don't think it is worth making 
> this further configurable, and given the suggested strategy above the 
> cache should remain fairly small.

Those suggestions make sense to me.

-Peff

^ permalink raw reply

* Re: [RFC] git --trace: trace command execution
From: Jakub Narebski @ 2006-06-29 18:06 UTC (permalink / raw)
  To: git
In-Reply-To: <7v3bdtv4h3.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:

> By the way "git cat-file -p" or "git verify-tag -v" might be
> more pleasant to view a tag since they make the tagger timestamp
> human readable.

Interesting, -p makes tagger timestamp human readable, but not author or
commiter:

$ git cat-file -p `cat .git/refs/tags/v1.4.0`
tagger Junio C Hamano <junkio@cox.net> Sat Jun 10 12:43:37 2006 -0700

$ git cat-file -p `cat .git/refs/heads/origin`
author Johannes Schindelin <Johannes.Schindelin@gmx.de> 1151491527 +0200
committer Junio C Hamano <junkio@cox.net> 1151492136 -0700

Is it intended, or a bug/missing feature (git 1.4.0)?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: xdiff: generate "anti-diffs" aka what is common to two files
From: Davide Libenzi @ 2006-06-29 18:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0606282149060.12404@g5.osdl.org>

On Wed, 28 Jun 2006, Linus Torvalds wrote:

> Davide, I say it's a "fairly trivial patch", but quite frankly, I'm
> winging it. I'm not that comfortable with the libxdiff internal workings,
> so while this seems to work for me and make sense, can you please take a
> quick look.

Looks fine to me.



- Davide

^ permalink raw reply

* Re: Improved three-way blob merging code
From: Davide Libenzi @ 2006-06-29 18:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0606282157210.12404@g5.osdl.org>

On Wed, 28 Jun 2006, Linus Torvalds wrote:

> It would be lovely if libxdiff did a 3-way merge on its own, but this
> basically approximates it within that three_way_filemerge() function
> using external functionality.

This is my todo-list, that unfortunately is pretty crowded nowadays. I'll 
come graveling you main window once I have it ;)



- Davide

^ permalink raw reply

* Re: [PATCH] Allow INSTALL, bindir, mandir to be set in main Makefile
From: Junio C Hamano @ 2006-06-29 18:23 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <200606291835.53788.jnareb@gmail.com>

Jakub Narebski <jnareb@gmail.com> writes:

> Part of autoconf series, but independent.

I'd like to take something like this, independently from
"optionally managing config.mak with autoconf" series.

> Should probably be split into two patches:
>  * first with export + '?='
>  * second renaming man1 and man7 to man1dir and man7dir

And I think it is probably a good idea to somehow keep people's
configurations that have been overriding man1 and man7 if
possible.  Otherwise things would regress for them.

^ permalink raw reply

* Re: [PATCH] autoconf: Use autoconf to write installation directories to config.mak
From: Junio C Hamano @ 2006-06-29 18:23 UTC (permalink / raw)
  To: Matthias Lederhofer; +Cc: git, Jakub Narebski
In-Reply-To: <E1FvvuX-0002Lr-Nt@moooo.ath.cx>

Matthias Lederhofer <matled@gmx.net> writes:

>> This is beginning of patch series introducing installation configuration
>> using autoconf (and no other autotools) to git. The idea is to generate
>> config.mak using ./configure (generated from configure.ac) from
>> config.mak.in, so one can use autoconf as an _alternative_ to ordinary
>> Makefile, and creating one's own config.mak.
>
> Are you sure this should be named config.mak? From INSTALL:
>> You can place local settings in config.mak and the Makefile
>> will include them.  Note that config.mak is not distributed;
>> the name is reserved for local settings.
>
> So with another filename either include it
> - before config.mak: the user may override ./configure options with
>   config.mak
> - after config.mak: ./configure overrides config.mak
> At least do not overwrite config.mak if it exists.

Think of it as an incredibly smart editor that you can use to
edit your config.mak.  As far as I understand it, this is an
"opt-in" way for you to manage config.mak without editing it
yourself, so I do not have much objections to the series in
principle.

I think it is fair to say that autoconf is not so bad compared
to its worse friends, but I am biased -- I used to deal with
autoconf quite a lot in old days (before there were automake nor
libtool), and my own changes to autoconf might still be
surviving in a few .m4 files but I doubt it it was so long ago.

Having said that, I am skeptical how well we can keep this
contained only to config.mak and keep it "opt-in", so I am not
enthused to have this series at the toplevel of the tree.

It would have been a bit easier to swallow if this whole
machinery to build config.mk were somewhere under contrib/ (say
in contrib/autoconf), with an instruction to make an "opt-in"
symlink "ln -s contrib/autoconf/config.mk config.mk" for people
who want to use it in the toplevel INSTALL file, perhaps.

The one that touches the Makefile to propagate some variables
down to submakes is probably a desirable change but that is
pretty much independent from the series.

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 18:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git
In-Reply-To: <20060629180011.GA4392@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> > This is way suboptimal.  First there is no reason for the cache to ever 
> > grow to N^2.  At worst it should be N*10 where 10 is the current window 
> > size.
> 
> I assumed the window would change over time (though our total is still
> likely to hang around N*10 rather than N^2).

It doesn't change unless you force a different window size.

> > none of the objects found in a given window.  Therefore we only have to 
> > compute a hash for the object names found in that window and store that 
> > in the cache.  So the cache entries would then be a pair of sha1: first 
> > the sha1 of the victim object, and the sha1 of all sha1 names for the 
> > objects against which the victim object was found not to delta well 
> > against.
> 
> This will fail to hit the cache anytime the window changes. How often
> does the window change? In my test case, I would think anytime I added a
> bunch of new photos, it would be likely that one of them would make it
> into the window, thus invalidating the cache entry and forcing me to try
> against every object in the window (even though I've already tried
> 9/10).

Sure.  But on the lot how often will that happen?
This trades a bit of CPU for much smaller cache which might be worth it.

And even then, since my suggested method implies only one cache lookup 
in a much smaller cache instead of 10 lookups in a larger cache for each 
objects it might end up faster overall even if sometimes some windows 
don't match and deltas are recomputed needlessly.

> Also, is it true to say "if this object did not delta against this
> window, it will never delta?" What about interactions with the depth
> parameter?

Of course a greater depth might allow for a hit where there isn't any 
otherwise.  But changing the delta depth is not something someone does 
that often, and when the depth is changed then you better use -f with 
git-repack as well which like I said should also ignore the cache.

> > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> 
> Keep in mind that it will grow every time the window changes.

What do you mean by window change?


Nicolas

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Linus Torvalds @ 2006-06-29 18:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Alex Riesen, git, Junio C Hamano, Fredrik Kuivinen
In-Reply-To: <Pine.LNX.4.63.0606291814200.29667@wbgn013.biozentrum.uni-wuerzburg.de>



On Thu, 29 Jun 2006, Johannes Schindelin wrote:
> 
> My point being: it makes no sense to split off get_merge_bases() if nobody 
> uses it except for git-merge-base.

I disagree. Your get_merge_bases() function is another small step in 
cleanup and libification, so I think it's perfectly valid regardless of 
whether somebody uses it yet.

A lot of the libification effort ends up being preparatory, because often 
you need to get past a hump of libifying a _lot_ in order to actually use 
something in a library.

So at first, the only user is usually the old separate program. So you 
libify get_merge_bases, and initially the only user is git-merge-base, 
which could now be made a built-in, since the bulk of its code is going to 
be linked in regardless..

		Linus

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Junio C Hamano @ 2006-06-29 18:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.63.0606291814200.29667@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> My point being: it makes no sense to split off get_merge_bases() if nobody 
> uses it except for git-merge-base.

I do not think that is a good reasoning.  If something is
reusable (or you made it reusable) and you are planning to reuse
it later, splitting it out without changing anything else to
make sure the split is correct is a seemingly small but a very
important step.

^ permalink raw reply

* Re: CFT: merge-recursive in C (updated)
From: Johannes Schindelin @ 2006-06-29 18:40 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, git, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <81b0412b0606290143g5a3a0f5atbda3f4250411e92e@mail.gmail.com>

Hi,

On Thu, 29 Jun 2006, Alex Riesen wrote:

> [cc list restored, I'm lost in the maze of git update-index, all cache
> changing functions looking almost the same]
> 
> On 6/29/06, Junio C Hamano <junkio@cox.net> wrote:
> > > this broke t6022-merge-rename.sh (the second test). It produces an
> > > index with this:
> > >
> > > .../t/trash$ git-diff-index white
> > > :100644 100644 2d603156dc5bdf6295c789cac08e3c9942a0b82a
> > 0000000000000000000000000000000000000000 M      B
> > > :100644 100644 ba41fb96393979b22691106b06bf5231eab57b85
> > 0000000000000000000000000000000000000000 M      N
> > >
> > > whereas git-merge-recursive (and the previous version, without pipe):
> > >
> > > .../t/trash$ git-diff-index white
> > > :100644 100644 2d603156dc5bdf6295c789cac08e3c9942a0b82a
> > 0000000000000000000000000000000000000000 M      B
> > >
> > > I can see that "git update-index --add" is somehow different from a
> > > pipe to "git update-index --index-info", but not very clear. Does this
> > > "zero-sha1" mean that the file "N" is not in the index?
> > 
> > When diff-index and diff-files compare a tree entry or an index
> > entry with a file in the working tree, they do not compute the
> > blob hash value for the file in the working tree.  0{40} is used
> > on the RHS in such a case.  When the working tree file matches
> > the corresponding index entry, then we know RHS matches what is
> > in the index, so both sides have the blob hash value.
> 
> Ok. Am I correct in the assumption, that if the file in working tree has
> the same SHA1 as LHS, than the next "git-update-index --refresh" will
> remove the entry from git-diff-index output?
> This is what actually happens, if I do "git-update-index --refresh", so I
> suspect that I have an SHA1 update gone missing somewhere.

I think the problem is more like this (in ce_match_stat_basic()):

        if (ce->ce_mtime.sec != htonl(st->st_mtime))
                changed |= MTIME_CHANGED;
        if (ce->ce_ctime.sec != htonl(st->st_ctime))
                changed |= CTIME_CHANGED;

If you update with --index-info, the mtime and ctime is not updated 
from the file in the working directory.

All the more a reason to go forward with direct calls to read_cache() and 
write_cache(). At the moment, my plan is

- trivially split the read_cache() code into read_cache()/read_cache_from(),
- introduce flush_cache(),
- trivially rewrite add_file_to_cache(), and add_cache_info() from
  builtin-update-index.c, move that to read-tree.c, too, and
- throw out the pipe to git-update-index from merge-recursive.c 
  altogether.

This should be less intrusive than it sounds: with the introduction of 
read_cache_from() it should be trivial to handle the problem of different 
index files.

We have to put a lock_file on the index file at the start, and write the 
index file at the end.

Ciao,
Dscho

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060629180719.GB4392@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:
> 
> > You do that lookup for every delta match attempt.  Instead it could be 
> > done once for the whole window attempt, potentially reducing the cache 
> > size by a factor of 20, and it might be faster too.
> 
> I'm not convinced this will provide good cache hit characteristics, and
> I'm not convinced it's semantically correct (see my other mail).

Dare to test it?  I provided you with most of the code difference 
already.

And what do you mean by "semantically correct"?


Nicolas

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606291410420.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:

> > I assumed the window would change over time (though our total is still
> > likely to hang around N*10 rather than N^2).
> It doesn't change unless you force a different window size.

Sorry, I meant "the items in the window for a given object would change
over time." 

> > This will fail to hit the cache anytime the window changes. How often
> > does the window change? In my test case, I would think anytime I added a
> > bunch of new photos, it would be likely that one of them would make it
> > into the window, thus invalidating the cache entry and forcing me to try
> > against every object in the window (even though I've already tried
> > 9/10).
> Sure.  But on the lot how often will that happen?

Reasonably often, according to my test. I did this to simulate usage
over time:
  - create an empty repo
  - from my test repo of 515 images, grab 20 at a time and add/commit
    them
  - after each commit, record the SHA1 of (object, window[0..n]) for
    each object to be delta'd
If doing the cache on the sha1 of the whole window is a good idea, then
we should see many of the same hashes from commit to commit. If we
don't, that means the newly added files are being placed in the old
windows, thus disrupting their hashes.

The results were that there was typically only 1 reusable window each
time I added 20 files. At that point, caching is largely pointless.

> And even then, since my suggested method implies only one cache lookup 
> in a much smaller cache instead of 10 lookups in a larger cache for each 
> objects it might end up faster overall even if sometimes some windows 
> don't match and deltas are recomputed needlessly.

I didn't benchmark, but I doubt it will have significant impact.
Especially on my photo test repo, the lookups are dominated by the
create_delta time by several orders of magnitude.

> Of course a greater depth might allow for a hit where there isn't any 
> otherwise.  But changing the delta depth is not something someone does 
> that often, and when the depth is changed then you better use -f with 
> git-repack as well which like I said should also ignore the cache.

That sounds reasonable to me for depth. What about other reasons for
try_delta to fail? Preferred base?

> > > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> > Keep in mind that it will grow every time the window changes.
> What do you mean by window change?

I meant that when the window we use for a given object changes, it will
insert a new cache entry. But if we deal with invalidating unused cache
entries as you suggested before, it won't matter.

-Peff

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:58 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0606291444370.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:

> Dare to test it?  I provided you with most of the code difference 
> already.

See my other mail. Unless I did something horribly wrong, caching full
windows is largely useless.

> And what do you mean by "semantically correct"?

I mean that right now the cache means "calling create_delta on content
that has sha1_a against content that has sha1_b will not produce a
useful delta." That makes sense since the content is never going to
change for a given sha1. However, covering the whole window takes into
account depth and preferred base. We just need to be sure that it is
correct to be including those in our cache calculation (I don't know,
and I'll defer to your judgement on that, since you clearly know more
about the delta logic than I do).

-Peff

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 19:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git
In-Reply-To: <20060629185335.GA6704@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:
> 
> > > I assumed the window would change over time (though our total is still
> > > likely to hang around N*10 rather than N^2).
> > It doesn't change unless you force a different window size.
> 
> Sorry, I meant "the items in the window for a given object would change
> over time." 
> 
> > > This will fail to hit the cache anytime the window changes. How often
> > > does the window change? In my test case, I would think anytime I added a
> > > bunch of new photos, it would be likely that one of them would make it
> > > into the window, thus invalidating the cache entry and forcing me to try
> > > against every object in the window (even though I've already tried
> > > 9/10).
> > Sure.  But on the lot how often will that happen?
> 
> Reasonably often, according to my test. I did this to simulate usage
> over time:
>   - create an empty repo
>   - from my test repo of 515 images, grab 20 at a time and add/commit
>     them
>   - after each commit, record the SHA1 of (object, window[0..n]) for
>     each object to be delta'd
> If doing the cache on the sha1 of the whole window is a good idea, then
> we should see many of the same hashes from commit to commit. If we
> don't, that means the newly added files are being placed in the old
> windows, thus disrupting their hashes.
> 
> The results were that there was typically only 1 reusable window each
> time I added 20 files. At that point, caching is largely pointless.

Right.  Your use pattern is a special case that doesn't work well with 
the whole window hash approach.  I'd expect it to work beautifully with 
the kernel repository though.

> > And even then, since my suggested method implies only one cache lookup 
> > in a much smaller cache instead of 10 lookups in a larger cache for each 
> > objects it might end up faster overall even if sometimes some windows 
> > don't match and deltas are recomputed needlessly.
> 
> I didn't benchmark, but I doubt it will have significant impact.
> Especially on my photo test repo, the lookups are dominated by the
> create_delta time by several orders of magnitude.

Again I think it is a repo like the linux kernel that would benefit 
more.

> > Of course a greater depth might allow for a hit where there isn't any 
> > otherwise.  But changing the delta depth is not something someone does 
> > that often, and when the depth is changed then you better use -f with 
> > git-repack as well which like I said should also ignore the cache.
> 
> That sounds reasonable to me for depth. What about other reasons for
> try_delta to fail? Preferred base?

Hmmm.  That might need to be dealth with (easily but still).


Nicolas

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 19:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060629185759.GB6704@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:
> 
> > Dare to test it?  I provided you with most of the code difference 
> > already.
> 
> See my other mail. Unless I did something horribly wrong, caching full
> windows is largely useless.

... on your special photo repository I agree.

I'm still unconvinced for large repos though.


Nicolas

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Johannes Schindelin @ 2006-06-29 19:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vmzbvrela.fsf@assigned-by-dhcp.cox.net>

Hi Linus, Hi Junio

[this is a response to both of your responses; mail threads cannot yet be 
merged a la git ;-)]

On Thu, 29 Jun 2006, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > My point being: it makes no sense to split off get_merge_bases() if nobody 
> > uses it except for git-merge-base.
> 
> I do not think that is a good reasoning.  If something is
> reusable (or you made it reusable) and you are planning to reuse
> it later, splitting it out without changing anything else to
> make sure the split is correct is a seemingly small but a very
> important step.

Okay. Convinced.

I tested my patch again, and like Alex said, a test fails. But I tested on 
top of Alex's latest merge-recursive patch, which has that nasty 
update-index bug, and that is the reason for the test to fail.

So, a few tests later, I am pretty sure that my patches do not break 
git-merge-base. I'll prepare another patch series which builds-in 
merge-base.

Ciao,
Dscho

^ permalink raw reply

* rebasing trouble
From: J. Bruce Fields @ 2006-06-29 19:47 UTC (permalink / raw)
  To: Git Mailing List

I must be missing something obvious:

bfields@puzzle:linux-2.6$ git checkout -b TMP nfs-client-stable^^^
bfields@puzzle:linux-2.6$ git-describe
v2.6.17-rc6-g28df955
bfields@puzzle:linux-2.6$ git-rebase --onto v2.6.17 origin
Nothing to do.
bfields@puzzle:linux-2.6$ git-describe
v2.6.17

So the git-rebase just reset TMP to v2.6.17.  But I know that nfs-client-stable
isn't a subset of origin, so this doesn't make sense to me.

The tree in question is actually at git://linux-nfs.org/~bfields/linux.git, if
it matters.

--b.

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 19:52 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606291458110.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 03:04:15PM -0400, Nicolas Pitre wrote:

> Right.  Your use pattern is a special case that doesn't work well with 
> the whole window hash approach.  I'd expect it to work beautifully with 
> the kernel repository though.

I don't necessarily care about the kernel repository. It packs fine as
it is, and you only waste 30 seconds on a repack checking deltas that
could be avoided. I do care on my special repository where packing is
virtually unusable and I can achieve a 45x speedup. Maybe my original
caching is not worth it for the kernel and should be configurable,
but obviously this window caching cannot REPLACE mine since it fails
utterly for the one thing I wanted it for.

That being said, I'm not sure that window caching is all that great for
"normal" repos.

Same test as before, but instead of simulating the commits, I merely
looked at the window hashes produced by 
  git-rev-list --objects master~$x

For the git repo:
x=0 tries 6698 windows
x=0 and x=50 contain 5197 identical windows
x=0 and x=100 contain 2484 identical windows
x=0 and x=500 contain 455 identical windows

For linux-2.6 repo:
x=0 tries 57208 windows
x=0 and x=50 contain 53677 identical windows
x=0 and x=100 contain 52886 identical windows
x=0 and x=500 contain 41196 identical windows

Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
Trying with --before=2.weeks.ago yields only 31505 matches.

So the windows do clearly experience a fair bit of churn, but whether or
not this is worth it depends on how long you think is reasonable before
something gets "churned out" .

-Peff

^ permalink raw reply

* Re: rebasing trouble
From: Junio C Hamano @ 2006-06-29 20:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: git
In-Reply-To: <20060629194723.GD14287@fieldses.org>

"J. Bruce Fields" <bfields@fieldses.org> writes:

> I must be missing something obvious:
>
> bfields@puzzle:linux-2.6$ git checkout -b TMP nfs-client-stable^^^
> bfields@puzzle:linux-2.6$ git-describe
> v2.6.17-rc6-g28df955
> bfields@puzzle:linux-2.6$ git-rebase --onto v2.6.17 origin
> Nothing to do.
> bfields@puzzle:linux-2.6$ git-describe
> v2.6.17
>
> So the git-rebase just reset TMP to v2.6.17.  But I know that nfs-client-stable
> isn't a subset of origin, so this doesn't make sense to me.
>
> The tree in question is actually at git://linux-nfs.org/~bfields/linux.git, if
> it matters.

It matters of course.  Where is the "origin"?
 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox