Git development
 help / color / mirror / Atom feed
* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606291410420.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:

> > I assumed the window would change over time (though our total is still
> > likely to hang around N*10 rather than N^2).
> It doesn't change unless you force a different window size.

Sorry, I meant "the items in the window for a given object would change
over time." 

> > This will fail to hit the cache anytime the window changes. How often
> > does the window change? In my test case, I would think anytime I added a
> > bunch of new photos, it would be likely that one of them would make it
> > into the window, thus invalidating the cache entry and forcing me to try
> > against every object in the window (even though I've already tried
> > 9/10).
> Sure.  But on the lot how often will that happen?

Reasonably often, according to my test. I did this to simulate usage
over time:
  - create an empty repo
  - from my test repo of 515 images, grab 20 at a time and add/commit
    them
  - after each commit, record the SHA1 of (object, window[0..n]) for
    each object to be delta'd
If doing the cache on the sha1 of the whole window is a good idea, then
we should see many of the same hashes from commit to commit. If we
don't, that means the newly added files are being placed in the old
windows, thus disrupting their hashes.

The results were that there was typically only 1 reusable window each
time I added 20 files. At that point, caching is largely pointless.

> And even then, since my suggested method implies only one cache lookup 
> in a much smaller cache instead of 10 lookups in a larger cache for each 
> objects it might end up faster overall even if sometimes some windows 
> don't match and deltas are recomputed needlessly.

I didn't benchmark, but I doubt it will have significant impact.
Especially on my photo test repo, the lookups are dominated by the
create_delta time by several orders of magnitude.

> Of course a greater depth might allow for a hit where there isn't any 
> otherwise.  But changing the delta depth is not something someone does 
> that often, and when the depth is changed then you better use -f with 
> git-repack as well which like I said should also ignore the cache.

That sounds reasonable to me for depth. What about other reasons for
try_delta to fail? Preferred base?

> > > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> > Keep in mind that it will grow every time the window changes.
> What do you mean by window change?

I meant that when the window we use for a given object changes, it will
insert a new cache entry. But if we deal with invalidating unused cache
entries as you suggested before, it won't matter.

-Peff

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060629180719.GB4392@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:
> 
> > You do that lookup for every delta match attempt.  Instead it could be 
> > done once for the whole window attempt, potentially reducing the cache 
> > size by a factor of 20, and it might be faster too.
> 
> I'm not convinced this will provide good cache hit characteristics, and
> I'm not convinced it's semantically correct (see my other mail).

Dare to test it?  I provided you with most of the code difference 
already.

And what do you mean by "semantically correct"?


Nicolas

^ permalink raw reply

* Re: CFT: merge-recursive in C (updated)
From: Johannes Schindelin @ 2006-06-29 18:40 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, git, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <81b0412b0606290143g5a3a0f5atbda3f4250411e92e@mail.gmail.com>

Hi,

On Thu, 29 Jun 2006, Alex Riesen wrote:

> [cc list restored, I'm lost in the maze of git update-index, all cache
> changing functions looking almost the same]
> 
> On 6/29/06, Junio C Hamano <junkio@cox.net> wrote:
> > > this broke t6022-merge-rename.sh (the second test). It produces an
> > > index with this:
> > >
> > > .../t/trash$ git-diff-index white
> > > :100644 100644 2d603156dc5bdf6295c789cac08e3c9942a0b82a
> > 0000000000000000000000000000000000000000 M      B
> > > :100644 100644 ba41fb96393979b22691106b06bf5231eab57b85
> > 0000000000000000000000000000000000000000 M      N
> > >
> > > whereas git-merge-recursive (and the previous version, without pipe):
> > >
> > > .../t/trash$ git-diff-index white
> > > :100644 100644 2d603156dc5bdf6295c789cac08e3c9942a0b82a
> > 0000000000000000000000000000000000000000 M      B
> > >
> > > I can see that "git update-index --add" is somehow different from a
> > > pipe to "git update-index --index-info", but not very clear. Does this
> > > "zero-sha1" mean that the file "N" is not in the index?
> > 
> > When diff-index and diff-files compare a tree entry or an index
> > entry with a file in the working tree, they do not compute the
> > blob hash value for the file in the working tree.  0{40} is used
> > on the RHS in such a case.  When the working tree file matches
> > the corresponding index entry, then we know RHS matches what is
> > in the index, so both sides have the blob hash value.
> 
> Ok. Am I correct in the assumption, that if the file in working tree has
> the same SHA1 as LHS, than the next "git-update-index --refresh" will
> remove the entry from git-diff-index output?
> This is what actually happens, if I do "git-update-index --refresh", so I
> suspect that I have an SHA1 update gone missing somewhere.

I think the problem is more like this (in ce_match_stat_basic()):

        if (ce->ce_mtime.sec != htonl(st->st_mtime))
                changed |= MTIME_CHANGED;
        if (ce->ce_ctime.sec != htonl(st->st_ctime))
                changed |= CTIME_CHANGED;

If you update with --index-info, the mtime and ctime is not updated 
from the file in the working directory.

All the more a reason to go forward with direct calls to read_cache() and 
write_cache(). At the moment, my plan is

- trivially split the read_cache() code into read_cache()/read_cache_from(),
- introduce flush_cache(),
- trivially rewrite add_file_to_cache(), and add_cache_info() from
  builtin-update-index.c, move that to read-tree.c, too, and
- throw out the pipe to git-update-index from merge-recursive.c 
  altogether.

This should be less intrusive than it sounds: with the introduction of 
read_cache_from() it should be trivial to handle the problem of different 
index files.

We have to put a lock_file on the index file at the start, and write the 
index file at the end.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Junio C Hamano @ 2006-06-29 18:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.63.0606291814200.29667@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> My point being: it makes no sense to split off get_merge_bases() if nobody 
> uses it except for git-merge-base.

I do not think that is a good reasoning.  If something is
reusable (or you made it reusable) and you are planning to reuse
it later, splitting it out without changing anything else to
make sure the split is correct is a seemingly small but a very
important step.

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Linus Torvalds @ 2006-06-29 18:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Alex Riesen, git, Junio C Hamano, Fredrik Kuivinen
In-Reply-To: <Pine.LNX.4.63.0606291814200.29667@wbgn013.biozentrum.uni-wuerzburg.de>



On Thu, 29 Jun 2006, Johannes Schindelin wrote:
> 
> My point being: it makes no sense to split off get_merge_bases() if nobody 
> uses it except for git-merge-base.

I disagree. Your get_merge_bases() function is another small step in 
cleanup and libification, so I think it's perfectly valid regardless of 
whether somebody uses it yet.

A lot of the libification effort ends up being preparatory, because often 
you need to get past a hump of libifying a _lot_ in order to actually use 
something in a library.

So at first, the only user is usually the old separate program. So you 
libify get_merge_bases, and initially the only user is git-merge-base, 
which could now be made a built-in, since the bulk of its code is going to 
be linked in regardless..

		Linus

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 18:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git
In-Reply-To: <20060629180011.GA4392@coredump.intra.peff.net>

On Thu, 29 Jun 2006, Jeff King wrote:

> > This is way suboptimal.  First there is no reason for the cache to ever 
> > grow to N^2.  At worst it should be N*10 where 10 is the current window 
> > size.
> 
> I assumed the window would change over time (though our total is still
> likely to hang around N*10 rather than N^2).

It doesn't change unless you force a different window size.

> > none of the objects found in a given window.  Therefore we only have to 
> > compute a hash for the object names found in that window and store that 
> > in the cache.  So the cache entries would then be a pair of sha1: first 
> > the sha1 of the victim object, and the sha1 of all sha1 names for the 
> > objects against which the victim object was found not to delta well 
> > against.
> 
> This will fail to hit the cache anytime the window changes. How often
> does the window change? In my test case, I would think anytime I added a
> bunch of new photos, it would be likely that one of them would make it
> into the window, thus invalidating the cache entry and forcing me to try
> against every object in the window (even though I've already tried
> 9/10).

Sure.  But on the lot how often will that happen?
This trades a bit of CPU for much smaller cache which might be worth it.

And even then, since my suggested method implies only one cache lookup 
in a much smaller cache instead of 10 lookups in a larger cache for each 
objects it might end up faster overall even if sometimes some windows 
don't match and deltas are recomputed needlessly.

> Also, is it true to say "if this object did not delta against this
> window, it will never delta?" What about interactions with the depth
> parameter?

Of course a greater depth might allow for a hit where there isn't any 
otherwise.  But changing the delta depth is not something someone does 
that often, and when the depth is changed then you better use -f with 
git-repack as well which like I said should also ignore the cache.

> > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> 
> Keep in mind that it will grow every time the window changes.

What do you mean by window change?


Nicolas

^ permalink raw reply

* Re: [PATCH] autoconf: Use autoconf to write installation directories to config.mak
From: Junio C Hamano @ 2006-06-29 18:23 UTC (permalink / raw)
  To: Matthias Lederhofer; +Cc: git, Jakub Narebski
In-Reply-To: <E1FvvuX-0002Lr-Nt@moooo.ath.cx>

Matthias Lederhofer <matled@gmx.net> writes:

>> This is beginning of patch series introducing installation configuration
>> using autoconf (and no other autotools) to git. The idea is to generate
>> config.mak using ./configure (generated from configure.ac) from
>> config.mak.in, so one can use autoconf as an _alternative_ to ordinary
>> Makefile, and creating one's own config.mak.
>
> Are you sure this should be named config.mak? From INSTALL:
>> You can place local settings in config.mak and the Makefile
>> will include them.  Note that config.mak is not distributed;
>> the name is reserved for local settings.
>
> So with another filename either include it
> - before config.mak: the user may override ./configure options with
>   config.mak
> - after config.mak: ./configure overrides config.mak
> At least do not overwrite config.mak if it exists.

Think of it as an incredibly smart editor that you can use to
edit your config.mak.  As far as I understand it, this is an
"opt-in" way for you to manage config.mak without editing it
yourself, so I do not have much objections to the series in
principle.

I think it is fair to say that autoconf is not so bad compared
to its worse friends, but I am biased -- I used to deal with
autoconf quite a lot in old days (before there were automake nor
libtool), and my own changes to autoconf might still be
surviving in a few .m4 files but I doubt it it was so long ago.

Having said that, I am skeptical how well we can keep this
contained only to config.mak and keep it "opt-in", so I am not
enthused to have this series at the toplevel of the tree.

It would have been a bit easier to swallow if this whole
machinery to build config.mk were somewhere under contrib/ (say
in contrib/autoconf), with an instruction to make an "opt-in"
symlink "ln -s contrib/autoconf/config.mk config.mk" for people
who want to use it in the toplevel INSTALL file, perhaps.

The one that touches the Makefile to propagate some variables
down to submakes is probably a desirable change but that is
pretty much independent from the series.

^ permalink raw reply

* Re: [PATCH] Allow INSTALL, bindir, mandir to be set in main Makefile
From: Junio C Hamano @ 2006-06-29 18:23 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <200606291835.53788.jnareb@gmail.com>

Jakub Narebski <jnareb@gmail.com> writes:

> Part of autoconf series, but independent.

I'd like to take something like this, independently from
"optionally managing config.mak with autoconf" series.

> Should probably be split into two patches:
>  * first with export + '?='
>  * second renaming man1 and man7 to man1dir and man7dir

And I think it is probably a good idea to somehow keep people's
configurations that have been overriding man1 and man7 if
possible.  Otherwise things would regress for them.

^ permalink raw reply

* Re: Improved three-way blob merging code
From: Davide Libenzi @ 2006-06-29 18:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0606282157210.12404@g5.osdl.org>

On Wed, 28 Jun 2006, Linus Torvalds wrote:

> It would be lovely if libxdiff did a 3-way merge on its own, but this
> basically approximates it within that three_way_filemerge() function
> using external functionality.

This is my todo-list, that unfortunately is pretty crowded nowadays. I'll 
come graveling you main window once I have it ;)



- Davide

^ permalink raw reply

* Re: xdiff: generate "anti-diffs" aka what is common to two files
From: Davide Libenzi @ 2006-06-29 18:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0606282149060.12404@g5.osdl.org>

On Wed, 28 Jun 2006, Linus Torvalds wrote:

> Davide, I say it's a "fairly trivial patch", but quite frankly, I'm
> winging it. I'm not that comfortable with the libxdiff internal workings,
> so while this seems to work for me and make sense, can you please take a
> quick look.

Looks fine to me.



- Davide

^ permalink raw reply

* Re: [RFC] git --trace: trace command execution
From: Jakub Narebski @ 2006-06-29 18:06 UTC (permalink / raw)
  To: git
In-Reply-To: <7v3bdtv4h3.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:

> By the way "git cat-file -p" or "git verify-tag -v" might be
> more pleasant to view a tag since they make the tagger timestamp
> human readable.

Interesting, -p makes tagger timestamp human readable, but not author or
commiter:

$ git cat-file -p `cat .git/refs/tags/v1.4.0`
tagger Junio C Hamano <junkio@cox.net> Sat Jun 10 12:43:37 2006 -0700

$ git cat-file -p `cat .git/refs/heads/origin`
author Johannes Schindelin <Johannes.Schindelin@gmx.de> 1151491527 +0200
committer Junio C Hamano <junkio@cox.net> 1151492136 -0700

Is it intended, or a bug/missing feature (git 1.4.0)?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0606291154510.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:

> You do that lookup for every delta match attempt.  Instead it could be 
> done once for the whole window attempt, potentially reducing the cache 
> size by a factor of 20, and it might be faster too.

I'm not convinced this will provide good cache hit characteristics, and
I'm not convinced it's semantically correct (see my other mail).

> You could simply recreate the cache on each run.  Or just keep a bitmap 

Yes, that would probably work and would be quite easy to do with the
existing code.

> First, I think it should be ignored (but still created) when 
> --no-reuse-delta is passed.  Then, it should not be created (but still 
> looked up if it exists and --no-reuse-delta is not provided) when the 
> pack index file is also not created.  I don't think it is worth making 
> this further configurable, and given the suggested strategy above the 
> cache should remain fairly small.

Those suggestions make sense to me.

-Peff

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Jeff King @ 2006-06-29 18:00 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606291053280.1213@localhost.localdomain>

On Thu, Jun 29, 2006 at 11:42:14AM -0400, Nicolas Pitre wrote:

> So the negative cache should not be O(N^2) either.  It just has to be 
> O(N*window).

Yes, unless we use the cache information to change the window (but I
don't think that's worth it, unless our window heuristics are bad).
The implementation I posted should already grow to O(N*window), since it
only saves what we try (and fail at).

> My past experiments showed that the best window size for compression is 
> a bit larger than the current default of 10.  It was rather around 15 
> for the kernel repository, with higher values than 15 not providing 
> significant improvements anymore.  But that is clearly a function of the 
> repository nature (the average number of revisions for each file).  But 
> the window size is directly connected to the computational cost.

Increasing the window, of course, has virtually no speed impact on later
repacks with the cache in effect. Packing with a window of 15 drops my
linux-2.6 pack to 108M from 122M. That helps offset the cache size
(though of course it takes longer on the initial pack).

> This is way suboptimal.  First there is no reason for the cache to ever 
> grow to N^2.  At worst it should be N*10 where 10 is the current window 
> size.

I assumed the window would change over time (though our total is still
likely to hang around N*10 rather than N^2).

> none of the objects found in a given window.  Therefore we only have to 
> compute a hash for the object names found in that window and store that 
> in the cache.  So the cache entries would then be a pair of sha1: first 
> the sha1 of the victim object, and the sha1 of all sha1 names for the 
> objects against which the victim object was found not to delta well 
> against.

This will fail to hit the cache anytime the window changes. How often
does the window change? In my test case, I would think anytime I added a
bunch of new photos, it would be likely that one of them would make it
into the window, thus invalidating the cache entry and forcing me to try
against every object in the window (even though I've already tried
9/10).

Also, is it true to say "if this object did not delta against this
window, it will never delta?" What about interactions with the depth
parameter?

> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Keep in mind that it will grow every time the window changes.

-Peff

^ permalink raw reply

* [PATCH] autoconf: Set mandir in config.mak.in and export variables not in Makefile
From: Jakub Narebski @ 2006-06-29 17:47 UTC (permalink / raw)
  To: git
In-Reply-To: <200606291835.53788.jnareb@gmail.com>

This patch with the previous "[PATCH] Allow INSTALL, bindir, mandir to
be set in main Makefile" patch allows ./configure script to set where
manpages will be installed using --mandir=DIR (./configure defaults to
PREFIX/man).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

Uwe Zeisberger wrote:
> autoconf does to much things, even with that little configure.ac.
> (But I agree, it's much better than automake :-)
>
> E.g.
>
>         ./configure --prefix=$HOME/usr --mandir=$HOME/usr/share/man
>
> is supported by the configure script, but the manpages are installed
> in $HOME/usr/man all the same.
>
> BTW: Even if I specify mandir=... in config.mak, it is not respected,
> because only the toplevel Makefile includes config.mak.  (I didn't
> test it, but I think I could export mandir in config.mak ...)

This patch and previous one adressess this.

 config.mak.in |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/config.mak.in b/config.mak.in
index 82d80e2..82c9781 100644
--- a/config.mak.in
+++ b/config.mak.in
@@ -8,5 +8,11 @@ #gitexecdir = @libexecdir@/git-core/
 template_dir = @datadir@/git-core/templates/
 GIT_PYTHON_DIR = @datadir@/git-core/python
 
+mandir=@mandir@
+
 srcdir = @srcdir@
 VPATH = @srcdir@
+
+export exec_prefix mandir
+export srcdir VPATH
+
-- 
1.4.0

^ permalink raw reply related

* Re: Improved three-way blob merging code
From: Linus Torvalds @ 2006-06-29 17:45 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, Git Mailing List, Davide Libenzi
In-Reply-To: <81b0412b0606290043s15e19b9fl853627e815f009ff@mail.gmail.com>



On Thu, 29 Jun 2006, Alex Riesen wrote:
>
> On 6/29/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > +static void *three_way_filemerge(mmfile_t *base, mmfile_t *our, mmfile_t
> > *their, unsigned long *size)
> > +{
> ...
> > +       if (t1 && t2 && t3) {
> > +               int code = run_command("merge", t2, t1, t3, NULL);
> 
> This does not use the labels of merge(1) and the merged file will contain
> the names of temporary files at conflict places, which is very confusing if
> you happen to loose context while doing a merge with lots of conflicts.

Yes. 

I was really really _really_ just hoping that I could do the nasty core 
code that others might feel uncomfortable with, and get it working from a 
_technical_ level well enough that others (hint hint) would decide that 
they can fix up the details.

Getting the first version working is often the hardest. When you reach the 
point of "it works, but I want to extend it to do X", you've usually 
already gotten pretty far.

Anyway, all of this was really just preparatory work to show what 
git-merge-tree does. A few more improvements to git-merge-tree, and 
hopefully it can start being useful - perhaps not initially for actually 
merging, but for doing a tree-level three-way diff between two branches.

In other words, my current goal is really do make it possible to get good 
diffs out of git from two branches that aren't directly related. You and 
Dscho seem to be doing well on the git-merge-recursive front, so my 
personal goal is actually to be able to get a saner diff than what

	git diff mine..theirs

gives you.

The above "git diff" is a perfectly fine thing to do, but it's usually not 
what you actually _want_. Almost always, what you want a diff between two 
branches, what you want is actually the diff after a merge of the 
branches, not the raw _current_ differences.

For example, look at our current 

	Documentation/howto/using-topic-branches.txt

file, and realize that the current scripts it suggests are actually 
broken:

	To create diffstat and shortlog summaries of changes to include in 
	a "pleasepull" request to Linus you can use:

	 $ git-whatchanged -p release ^linus | diffstat -p1
	and
	 $ git-whatchanged release ^linus | git-shortlog

where that "git-whatchanged -p release ^linus | diffstat -p1" won't 
actually be what I see when I merge (although it will hopefully be close 
enough). Also, there's no indication that the merge will fail when I pull, 
something that _would_ be very useful.

IOW, what I'd like git-merge-tree to do is to be able to at a minimum say:

 - will a merge succeed cleanly, and if not, show me where the problem 
   spots are.
 - what will the result of the merge look like.

because that's actually what a downstream developer wants to do. He'd just 
do

	git fetch linus
	git show-changes linus..my-branch

which would basically be the preparatory thing for sending me an email 
saying "please merge 'my-branch', and you'll see this".

Now, obviously, I think that there's a _lot_ of overlap between doing this 
and actually doing the merge itself, so hopefully the things I do will at 
least have some things in common and perhaps help you do the proper 
recursive merger.

But one thing I was actually hoping to do was to literally be able to do 
this without either tree being checked out, exactly so that you can check 
the status of a branch that you may not even _be_ on (eg that "my-branch" 
in the example above may not even be your current HEAD: you might be on 
your development branch when you actually ask me to pull from your stable 
branch).

So the "show differences" has a lot in common with "merge them", but there 
are literally a few stages missing. One thing missing in just showing 
differences is that you can't actually fix up the merge clashes: if you 
don't have a checked-out tree, you're not going to be able to do a real 
merge. But you can see what a merge would _look_ like, and whether there 
are any clashes that you will have to fix..

		Linus

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 16:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060629035849.GA30749@coredump.intra.peff.net>

On Wed, 28 Jun 2006, Jeff King wrote:

> >From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.  This patch makes a
> persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
> hashes (the presence of a pair indicates that it's not worth trying to
> delta those two objects).

I think this could be done differently to be even more space efficient 
as I suggested earlier.

> Here are some of my thoughts:
> 
>  - speed. The implementation is quite fast. The sha1 pairs are stored
>    sorted, and we mmap and binary search them. Certainly the extra time
>    spent in lookup is justified by avoiding the delta attempts.

You do that lookup for every delta match attempt.  Instead it could be 
done once for the whole window attempt, potentially reducing the cache 
size by a factor of 20, and it might be faster too.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

See my previous email for comments about this.

>    Theoretically, I could bound the cache size and boot old entries.
>    However, that means storing age information, which increases the
>    required size. I think keeping it simple is best.

You could simply recreate the cache on each run.  Or just keep a bitmap 
of referenced cache entries, and a list of new entries.  At the end if 
the new entry list is empty and the bitmap is all set (or clear) then 
you just keep the current cache.  ONce, say, more than 10% of cache 
entries are not used anymore then you regenerate the cache.  But since 
you need to regenerate it when new entries are added then the 10% unused 
entry criteria might not be used that often anyway, unless packs shrink 
with time which might not be the case really often.

>  - correctness. Currently the code uses the output of try_delta for
>    negative caching. Should the cache checking be moved inside try_delta
>    instead? This would give more control over which reasons to mark a
>    delta negative (I believe right now hitting the depth limit will
>    cause a negative mark; we should perhaps only do so if the content
>    itself makes the delta unpalatable).

Like I said earlier I think this should be moved up a bit i.e. outside 
the delta loop.  In find_deltas() right before the ...

		j = window;
		while (--j > 0) {

loop, just insert another loop that compute the sha1 of the current 
target object and window:

		SHA1_Init(&c);
		j = window;
		while (j-- > 0) { /* postdec include the current object as well */
			unsigned int other_idx = idx + j;
			struct unpacked *m;
			if (other_idx >= window)
				other_idx -= window;
			m = array + other_idx;
			if (!m->entry)
				break;
			SHA1_Update(&c, m->entry.sha1, 20);
		}
		SHA1_Final(negative_delta_hash, &c);			

Then you can skip over the whole of the delta loop right away if that 
negative_delta_hash matches one entry in the cache since you already 
know that this object with this window doesn't produce any delta result.

Otherwise, after the delta loop has proceeded, if there is no delta 
found then you can add negative_delta_hash to the cache.

>  - optionalness. Currently the delta-cache is always used. Since it is a
>    space-time tradeoff, maybe it should be optional (it will have
>    negligible performance and horrible size impact on a repo that
>    consists of many very small but unrelated objects).  Possible methods
>    include:
>      - enable cache saves only if .git/delta-cache is present; turn it
>        on initially with 'touch .git/delta-cache'
>      - config variable

First, I think it should be ignored (but still created) when 
--no-reuse-delta is passed.  Then, it should not be created (but still 
looked up if it exists and --no-reuse-delta is not provided) when the 
pack index file is also not created.  I don't think it is worth making 
this further configurable, and given the suggested strategy above the 
cache should remain fairly small.


Nicolas

^ permalink raw reply

* [PATCH] Allow INSTALL, bindir, mandir to be set in main Makefile
From: Jakub Narebski @ 2006-06-29 16:35 UTC (permalink / raw)
  To: git
In-Reply-To: <200606291704.27677.jnareb@gmail.com>

Makefiles in subdirectories now use existing value of INSTALL, bindir,
mandir if it is set, allowing those to be set in main Makefile or in
config.mak.  Main Makefile exports variables which it sets.

Renames man1 and man7 variables to man1dir and man7dir, according to
"Makefile Conventions: Variables for Installation Directories" in
make.info of GNU Make.  Renames bin to bindir (unused, perhaps to be
removed).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Part of autoconf series, but independent.

Should probably be split into two patches:
 * first with export + '?='
 * second renaming man1 and man7 to man1dir and man7dir

 Documentation/Makefile   |   14 +++++++-------
 Makefile                 |    2 ++
 contrib/emacs/Makefile   |    4 ++--
 contrib/git-svn/Makefile |    8 ++++----
 4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2b0efe7..cc83610 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -25,10 +25,10 @@ DOC_MAN1=$(patsubst %.txt,%.1,$(MAN1_TXT
 DOC_MAN7=$(patsubst %.txt,%.7,$(MAN7_TXT))
 
 prefix?=$(HOME)
-bin=$(prefix)/bin
-mandir=$(prefix)/man
-man1=$(mandir)/man1
-man7=$(mandir)/man7
+bindir?=$(prefix)/bin
+mandir?=$(prefix)/man
+man1dir=$(mandir)/man1
+man7dir=$(mandir)/man7
 # DESTDIR=
 
 INSTALL?=install
@@ -52,9 +52,9 @@ man1: $(DOC_MAN1)
 man7: $(DOC_MAN7)
 
 install: man
-	$(INSTALL) -d -m755 $(DESTDIR)$(man1) $(DESTDIR)$(man7)
-	$(INSTALL) $(DOC_MAN1) $(DESTDIR)$(man1)
-	$(INSTALL) $(DOC_MAN7) $(DESTDIR)$(man7)
+	$(INSTALL) -d -m755 $(DESTDIR)$(man1dir) $(DESTDIR)$(man7dir)
+	$(INSTALL) $(DOC_MAN1) $(DESTDIR)$(man1dir)
+	$(INSTALL) $(DOC_MAN7) $(DESTDIR)$(man7dir)
 
 
 #
diff --git a/Makefile b/Makefile
index cde619c..b8fe669 100644
--- a/Makefile
+++ b/Makefile
@@ -100,6 +100,8 @@ template_dir = $(prefix)/share/git-core/
 GIT_PYTHON_DIR = $(prefix)/share/git-core/python
 # DESTDIR=
 
+export prefix bindir gitexecdir template_dir GIT_PYTHON_DIR
+
 CC = gcc
 AR = ar
 TAR = tar
diff --git a/contrib/emacs/Makefile b/contrib/emacs/Makefile
index d3619db..350846d 100644
--- a/contrib/emacs/Makefile
+++ b/contrib/emacs/Makefile
@@ -3,9 +3,9 @@ ## Build and install stuff
 EMACS = emacs
 
 ELC = git.elc vc-git.elc
-INSTALL = install
+INSTALL ?= install
 INSTALL_ELC = $(INSTALL) -m 644
-prefix = $(HOME)
+prefix ?= $(HOME)
 emacsdir = $(prefix)/share/emacs/site-lisp
 
 all: $(ELC)
diff --git a/contrib/git-svn/Makefile b/contrib/git-svn/Makefile
index 7c20946..8cac688 100644
--- a/contrib/git-svn/Makefile
+++ b/contrib/git-svn/Makefile
@@ -1,9 +1,9 @@
 all: git-svn
 
 prefix?=$(HOME)
-bindir=$(prefix)/bin
-mandir=$(prefix)/man
-man1=$(mandir)/man1
+bindir?=$(prefix)/bin
+mandir?=$(prefix)/man
+man1dir=$(mandir)/man1
 INSTALL?=install
 doc_conf=../../Documentation/asciidoc.conf
 -include ../../config.mak
@@ -17,7 +17,7 @@ install: all
 	$(INSTALL) git-svn $(DESTDIR)$(bindir)
 
 install-doc: doc
-	$(INSTALL) git-svn.1 $(DESTDIR)$(man1)
+	$(INSTALL) git-svn.1 $(DESTDIR)$(man1dir)
 
 doc: git-svn.1
 git-svn.1 : git-svn.xml
-- 
1.4.0

^ permalink raw reply related

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 16:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git
In-Reply-To: <Pine.LNX.4.64.0606291053280.1213@localhost.localdomain>

On Thu, 29 Jun 2006, Nicolas Pitre wrote:

> And this can be pushed even further by just including the sha1 of the 
> victim object inside the list of objects therefore computing a hash of 
> all objects (the victim and the window) for which no delta results. The 
> cache is therefore a list of hash values corresponding to bad 
> victim+window combinations.
> 
> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Correction: the 40 bytes figure is for _ascii_ representation of sha1 
values.  The cache doesn't need ascii and therefore this number can be 
reduced by half.


Nicolas

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Johannes Schindelin @ 2006-06-29 16:14 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, Junio C Hamano, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <81b0412b0606290714v66a32976j531e2077ce6c1d77@mail.gmail.com>

Hi,

On Thu, 29 Jun 2006, Alex Riesen wrote:

> On 6/29/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > 
> > most of this patch is just a "sub-file rename", i.e. moving code
> > literally (sue me, SCO!) from merge-base.c to commit.c.
> > 
> 
> BTW, you probably want to post merge-recursive.c changes separately.

My point being: it makes no sense to split off get_merge_bases() if nobody 
uses it except for git-merge-base.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Johannes Schindelin @ 2006-06-29 16:14 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, Junio C Hamano, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <81b0412b0606290712h4960ee8et7ea219d4dd6428b4@mail.gmail.com>

Hi,

On Thu, 29 Jun 2006, Alex Riesen wrote:

> On 6/29/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Aargh! Of course this is [PATCH 2/2]. BTW, no signoff, since the whole
> > merge-recursive is not mergeable yet (passes the tests, but we have a
> > small way to go).
> 
> How did you get it to pass the tests? Maybe you still have git-merge-recursive
> as default merge strategy?

Darn. Yes.

Sorry for the noise.

Ciao,
Dscho

^ permalink raw reply

* Re: [RFC] Cache negative delta pairs
From: Nicolas Pitre @ 2006-06-29 15:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git
In-Reply-To: <7v4py4y7wo.fsf@assigned-by-dhcp.cox.net>

On Wed, 28 Jun 2006, Junio C Hamano wrote:

> Interesting idea.  I think this matters more because for a
> repository with many unrelated undeltifiable files, we do the
> computation for objects that results in _no_ delta.  For normal
> nearly fully packed repositories, once an object is deltified
> against something else, subsequent repacking of the same set of
> objects (or a superset thereof) will very likely reuse the delta
> without recomputation, so as long as each object _can_ be
> deltified with at least _one_ other object, you should not see
> improvement on them.
> 
> So I am curious where the speed-up comes from for "normal" repos
> in your experiments.

My GIT repo currently has 23622 objects and 7610 of them are currently 
undeltified.  Those objects are of course candidates for delta matching 
each time git-repack is run.

> If it turns out that in "normal" repos the
> objects that hit your negative cache are stored undeltified,
> then that suggests that it might be worthwhile to consider using
> a cache of "inherently undeltifiable objects", In other words, a
> negative cache of O(N) entries, instead of O(N^2) entries,

Actually... I'm not so sure.  Those objects are not "inherently 
undeltifiable".  They just happen to not have other objects to easily 
delta against in the given set of objects.  Think of a file with only 
one revision for example.  As soon as there is a second revision of that 
file added to the set of objects then the former revision would have a 
high probability of being deltifiable.

So the negative cache should not be O(N^2) either.  It just has to be 
O(N*window).

> Another interpretation of your result is that we may be using a
> delta window that is unnecessarily too deep, and your negative
> cache is collecting less optimum candidates that we attempt to
> deltify against "just in case".  Can you easily instrument your
> code to see where in the sorted delta candidate list the pairs
> that hit your the negative cache are?  That is, in find_deltas()
> function, we have "while (--j > 0)" loop that attempts to delta
> with the entry that is j (modulo window size) entries away from
> the current one, then j-1, j-2, ...; I am interested in the
> distribution of "j" value for the pair "n,m" that hits your
> negative cache for normal repositories, and I am speculating
> that the value would probably be small relative to the delta
> window size.

My past experiments showed that the best window size for compression is 
a bit larger than the current default of 10.  It was rather around 15 
for the kernel repository, with higher values than 15 not providing 
significant improvements anymore.  But that is clearly a function of the 
repository nature (the average number of revisions for each file).  But 
the window size is directly connected to the computational cost.

> Another idea is to have a cache of "paths at which inherently
> undeltifiable objects live in".  For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.

I'm afraid this could lead to bad behavior eventually.  Better to just 
attempt a delta once, and when an object has not found any delta 
base candidate then just write its sha1 and corresponding window to the 
cache.  This would imply an initial cost to create the cache the first 
time, but after that the created cache could be relied upon as hard 
information and not just as guess heuristics.

> >  - size. The cache is a packed sequence of binary sha1 pairs. I was
> >    concerned that it would grow too large (obviously for n blobs you can
> >    end up with n^2/2 entries), but it doesn't seem unreasonable for most
> >    repos (either you don't have a lot of files, or if you do, they delta
> >    reasonably well). My test repo's cache is only 144K. The git cache is
> >    about 2.7M. The linux-2.6 cache is 22M.

This is way suboptimal.  First there is no reason for the cache to ever 
grow to N^2.  At worst it should be N*10 where 10 is the current window 
size.

Next I think this can be made just N*2.  Consider that the criteria for 
skipping over delta matching for a given object is the fact 
that we already know that such object doesn't delta against 
none of the objects found in a given window.  Therefore we only have to 
compute a hash for the object names found in that window and store that 
in the cache.  So the cache entries would then be a pair of sha1: first 
the sha1 of the victim object, and the sha1 of all sha1 names for the 
objects against which the victim object was found not to delta well 
against.

And this can be pushed even further by just including the sha1 of the 
victim object inside the list of objects therefore computing a hash of 
all objects (the victim and the window) for which no delta results. The 
cache is therefore a list of hash values corresponding to bad 
victim+window combinations.

So given my GIT repository such a cache would be 7610 * 40 = 304400 
bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.


Nicolas

^ permalink raw reply

* [PATCH] autoconf: Cleanup generation of config.mak.append by ./configure
From: Jakub Narebski @ 2006-06-29 15:04 UTC (permalink / raw)
  To: git
In-Reply-To: <200606291536.18667.jnareb@gmail.com>

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
 configure.ac |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/configure.ac b/configure.ac
index fbd46e2..f48307c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -6,12 +6,14 @@ AC_INIT([git], [1.4.0], [git@vger.kernel
 
 AC_CONFIG_SRCDIR([git.c])
 
+echo "# config.mak.append.  Generated by configure." >> config.mak.append
+
 # Definitions of macros
 # MY_APPEND_LINE(LINE)
 # --------------------
 # Append LINE to file config.mak.append
 AC_DEFUN([MY_APPEND_LINE],
-[[echo "$1" >> config.mak.append]])# AC_APPEND_LINE
+[[echo "$1" >> config.mak.append]])# MY_APPEND_LINE
 
 
 # Checks for libraries.
@@ -45,6 +47,5 @@ AC_CHECK_FUNC(setenv,,MY_APPEND_LINE(NO_
 
 # Output files
 AC_CONFIG_FILES([config.mak:config.mak.in:config.mak.append], 
-[rm -f config.mak.append], 
-[echo >> config.mak.append])
+[rm -f config.mak.append])
 AC_OUTPUT
-- 
1.4.0

^ permalink raw reply related

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Alex Riesen @ 2006-06-29 14:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <Pine.LNX.4.63.0606291517010.29667@wbgn013.biozentrum.uni-wuerzburg.de>

On 6/29/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
> most of this patch is just a "sub-file rename", i.e. moving code
> literally (sue me, SCO!) from merge-base.c to commit.c.
>

BTW, you probably want to post merge-recursive.c changes separately.

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Alex Riesen @ 2006-06-29 14:13 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <Pine.LNX.4.63.0606291517010.29667@wbgn013.biozentrum.uni-wuerzburg.de>

On 6/29/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
> most of this patch is just a "sub-file rename", i.e. moving code
> literally (sue me, SCO!) from merge-base.c to commit.c.
>

Aah, thanks! Will try it later today.

^ permalink raw reply

* Re: [PATCH] move get_merge_bases() to core lib; use it in merge-recursive
From: Alex Riesen @ 2006-06-29 14:12 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Fredrik Kuivinen, Linus Torvalds
In-Reply-To: <Pine.LNX.4.63.0606291519440.29667@wbgn013.biozentrum.uni-wuerzburg.de>

On 6/29/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Aargh! Of course this is [PATCH 2/2]. BTW, no signoff, since the whole
> merge-recursive is not mergeable yet (passes the tests, but we have a
> small way to go).

How did you get it to pass the tests? Maybe you still have git-merge-recursive
as default merge strategy?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox