Git development
 help / color / mirror / Atom feed
* Re: Removal of "--merge-order"?
From: Linus Torvalds @ 2006-02-24 18:07 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0602240942520.7894@shark.he.net>



On Fri, 24 Feb 2006, Randy.Dunlap wrote:
>
> Other than Ryan's reply, I found 2 users in a quick search,
> but they have already stated that they are willing to change, so I
> don't see objections unless someone else comes forward.

One thing we could do - and might be simpler - is to make the merge-order 
thing be a post-processing phase of git-rev-list.

IOW, instead of

	git-rev-list --merge-order

we could perhaps do

	git-rev-list --parents [--topo-order?] | git-merge-order

so that the merge-order code wouldn't impact git-rev-list itself.

As it is, the merge-order code ends up hooking into the "process_commit" 
thing (and thus to "filter_commit" which does the parent rewriting, and 
then show_commit), which makes it harder to work with.

Now, rev-list.c is not the biggest file (apply.c is about twice the size), 
but in many ways it's the most complex one by far. It's also the most 
performance-critical one, and the one that it would be really nice if we 
were to be able to libify it.

For example, instead of the horrid scriping language, I _think_ I could 
almost libify it by just hooking into "show_commit", and using a callback 
function for that (and then the stand-alone program would just make the 
callback function be one that prints out the commit). 

With some care, we might be able to make things like "git diff" be small C 
programs (or, more likely, to save space and not replicate the binaries 
many times - make the "git" binary able to do all the simple things on its 
own: "git-diff" would be just a link to "git").

That would possibly be a simpler way to get away from using nonportable 
scripts. Plain C really does remain one of the most portable things out 
there.

			Linus

^ permalink raw reply

* Re: Removal of "--merge-order"?
From: Randy.Dunlap @ 2006-02-24 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Randy.Dunlap, Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0602240957430.22647@g5.osdl.org>

On Fri, 24 Feb 2006, Linus Torvalds wrote:

>
>
> On Fri, 24 Feb 2006, Randy.Dunlap wrote:
> >
> > Other than Ryan's reply, I found 2 users in a quick search,
> > but they have already stated that they are willing to change, so I
> > don't see objections unless someone else comes forward.
>
> One thing we could do - and might be simpler - is to make the merge-order
> thing be a post-processing phase of git-rev-list.
>
> IOW, instead of
>
> 	git-rev-list --merge-order
>
> we could perhaps do
>
> 	git-rev-list --parents [--topo-order?] | git-merge-order
>
> so that the merge-order code wouldn't impact git-rev-list itself.

Makes sense to me... thanks.
But even that may not be needed if noone else really needs it.

> As it is, the merge-order code ends up hooking into the "process_commit"
> thing (and thus to "filter_commit" which does the parent rewriting, and
> then show_commit), which makes it harder to work with.
>
> Now, rev-list.c is not the biggest file (apply.c is about twice the size),
> but in many ways it's the most complex one by far. It's also the most
> performance-critical one, and the one that it would be really nice if we
> were to be able to libify it.
>
> For example, instead of the horrid scriping language, I _think_ I could
> almost libify it by just hooking into "show_commit", and using a callback
> function for that (and then the stand-alone program would just make the
> callback function be one that prints out the commit).
>
> With some care, we might be able to make things like "git diff" be small C
> programs (or, more likely, to save space and not replicate the binaries
> many times - make the "git" binary able to do all the simple things on its
> own: "git-diff" would be just a link to "git").
>
> That would possibly be a simpler way to get away from using nonportable
> scripts. Plain C really does remain one of the most portable things out
> there.

-- 
~Randy

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 18:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241252300.31162@localhost.localdomain>

On Fri, Feb 24, 2006 at 12:56:04PM -0500, Nicolas Pitre wrote:
My version is 1.2.1.  I have not been following your work.  When was
pack data reuse introduced?  From where can I obtain your delta patches?

There is really no opportunity for pack-data reuse in this case.  The
repository had never been packed or cloned in the first place.  As I
said, I do not intend to pack these binary files at all since there is
no benefit in this case.

The delta patches may help but I can't say for sure since I don't know
anything about them.  Let me know where I can get them.

Carl

> 
> I must ask if you had applied my latest delta patches?
> 
> Also did you use a recent version of git that implements pack data 
> reuse?
> 
> 
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 18:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241252300.31162@localhost.localdomain>

I've updated to a very current master branch.  This seems to include the
pack data reuse stuff.  I've not made an attempt yet to apply your delta
patches.

git-repack quickly gets up to 5% (2/36) and hangs there.  I'll let it
run for a while just to see how far it claims to get.  I'm not hopeful.

Maybe your patches can help?

Carl

On Fri, Feb 24, 2006 at 12:56:04PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
> 
> > Junio,
> > 
> > This message came to me at exactly the right time.  Yesterday I was
> > exploring using git as the content storage back-end for some binary
> > files.  Up until now I've only used it for software projects.
> > 
> > I found the largest RCS file that we had in our current back-end.  It
> > contained twelve versions of a binary file.  Each version averaged about
> > 20 MB.  The ,v file from RCS was about 250MB.  I did some experiments on
> > these binary files.
> > 
> > First, gzip consistantly is able to compress these files to about 10%
> > their original size.  So, they are quite inflated.  Second, xdelta would
> > produce a delta between two neighboring revisions of about 2.5MB in size
> > that would compress down to about 2MB.  (about the same size as the next
> > revision compressed without deltification so packing is ineffective
> > here).
> > 
> > I added these 12 revisions to several version control back-ends
> > including subversion and git.  Git produced a much smaller repository
> > size than the others simply due to the compression that it applies to
> > objects.  It also was at least as fast as the others.
> > 
> > The problem came when I tried to clone this repository.
> > git-pack-objects chewed on these 12 revisions for over an hour before I
> > finally interrupted it.  As far as I could tell, it hadn't made much
> > progress.
> 
> I must ask if you had applied my latest delta patches?
> 
> Also did you use a recent version of git that implements pack data 
> reuse?
> 
> 
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 18:57 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Junio C Hamano, git
In-Reply-To: <20060224183554.GA31247@hpsvcnb.fc.hp.com>

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> My version is 1.2.1.  I have not been following your work.  When was
> pack data reuse introduced?

Try out version 1.2.3.

> From where can I obtain your delta patches?

Forget them for now -- they won't help you.

> There is really no opportunity for pack-data reuse in this case.  The
> repository had never been packed or cloned in the first place.  As I
> said, I do not intend to pack these binary files at all since there is
> no benefit in this case.

Yes there is, as long as you have version 1.2.3.  The clone logic will 
simply reuse already packed data without attempting to recompute it.

> The delta patches may help but I can't say for sure since I don't know
> anything about them.

They (actually the last one) might help reduce the size of resulting 
packs but it currently has performance problems with some patological 
data sets.

I think you really should try git version 1.2.3 with a packed 
repository.  It might handle your special case just fine.


Nicolas

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 19:03 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Junio C Hamano, git
In-Reply-To: <20060224184934.GA387@hpsvcnb.fc.hp.com>

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> I've updated to a very current master branch.  This seems to include the
> pack data reuse stuff.  I've not made an attempt yet to apply your delta
> patches.
> 
> git-repack quickly gets up to 5% (2/36) and hangs there.  I'll let it
> run for a while just to see how far it claims to get.  I'm not hopeful.

It should complete sometimes, probably after the same amount of time 
needed by your previous clone attempt.  But after that any clone 
operation should be quick.  This is clearly unacceptable but at least 
with the pack data reuse you should suffer only once for the initial 
repack.

> Maybe your patches can help?

No.  They actually make things worse performance wise, much worse in 
some special cases.

Is it possible for me to have access to 2 consecutive versions of your 
big binary file?


Nicolas

^ permalink raw reply

* Re: git-annotate efficiency
From: Randal L. Schwartz @ 2006-02-24 19:06 UTC (permalink / raw)
  To: Morten Welinder; +Cc: GIT Mailing List
In-Reply-To: <118833cc0602241000p4e4c8017u3e3afe76fbbd75a4@mail.gmail.com>

>>>>> "Morten" == Morten Welinder <mwelinder@gmail.com> writes:

Morten> It looks like handle_rev is seeing the same revisions over and over again.
Morten> I don't know why that would be, but the following patch just skips dups.
Morten> I have no idea if it is right, though.

Morten> Morten


Morten> diff --git a/git-annotate.perl b/git-annotate.perl
Morten> index 3800c46..a5e2d86 100755
Morten> --- a/git-annotate.perl
Morten> +++ b/git-annotate.perl
Morten> @@ -117,7 +117,10 @@ sub init_claim {

Morten>  sub handle_rev {
Morten>         my $i = 0;
Morten> +       my %seen = ();
Morten>         while (my $rev = shift @revqueue) {
Morten> +               next if $seen{$rev};
Morten> +               $seen{$rev} = 1;

Morten>                 my %revinfo = git_commit_info($rev);

The traditional idiom for that is

        next if $seen{$rev}++;

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 19:23 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241350190.31162@localhost.localdomain>

On Fri, Feb 24, 2006 at 01:57:20PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
> 
> > My version is 1.2.1.  I have not been following your work.  When was
> > pack data reuse introduced?
> 
> Try out version 1.2.3.

I'm on it now.

> > From where can I obtain your delta patches?
> 
> Forget them for now -- they won't help you.

Ah, I have been looking at your patches and clearly they will not help.

> > There is really no opportunity for pack-data reuse in this case.  The
> > repository had never been packed or cloned in the first place.  As I
> > said, I do not intend to pack these binary files at all since there is
> > no benefit in this case.
> 
> Yes there is, as long as you have version 1.2.3.  The clone logic will 
> simply reuse already packed data without attempting to recompute it.

I meant that there is no benefit in disk space usage.  Packing may
actually increase my disk space usage in this case.  Refer to what I
said about experimentally running gzip and xdelta on the files
independantly of git.

I see what you're saying about this data reuse helping to speed up
subsequent cloning operations.  However, if packing takes this long and
doesn't give me any disk space savings then I don't want to pay the very
heavy price of packing these files even the first time nor do I want to
pay the price incrementally.

The most I would tolerate for the first pack is a few seconds.  The most
I would tolerate for any incremental pack is about 1 second.

BTW, git repack has been going for 30 minutes and has packed 4/36
objects.  :-)

> I think you really should try git version 1.2.3 with a packed 
> repository.  It might handle your special case just fine.

No, not when I'm not willing to pay the price to pack even once.  This
isn't a case where I have one such repository and 'once its been packed
then its packed'.  This is only one example of such a repository.  I am
looking for a process for revisioning this type of data that will be
used over and over.  Git may not be the answer here but it sure is
looking good in many other ways.

I think the right answer would be for git to avoid trying to pack files
like this.  Junio mentioned something like this in his message.

Thanks for your input.

Cheers,
Carl

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply

* Re: Removal of "--merge-order"?
From: Junio C Hamano @ 2006-02-24 19:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0602240824110.3771@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> In other words, I'd really prefer if it was gone. Some of the things I 
> might do to git-rev-list would be much simpler if I didn't have to worry 
> about merge-order, and the way it interfaces with the rest of 
> git-rev-list.
>
> Comments?
>
> 			Linus

I am really glad you brought it up.  I would not miss it at all.

^ permalink raw reply

* Re: git-annotate efficiency
From: Ryan Anderson @ 2006-02-24 19:42 UTC (permalink / raw)
  To: Morten Welinder; +Cc: GIT Mailing List
In-Reply-To: <118833cc0602241000p4e4c8017u3e3afe76fbbd75a4@mail.gmail.com>

On Fri, Feb 24, 2006 at 01:00:24PM -0500, Morten Welinder wrote:
> It looks like handle_rev is seeing the same revisions over and over again.
> I don't know why that would be, but the following patch just skips dups.
> I have no idea if it is right, though.

Merges.

a--b--c--d--f
   \-g--h--/

It would do f,d,c,b,a + f,h,g,b,a

So yes, this fix is correct, and Junio, I'll be doing some changes this
weekend and send it along with a few other things.

(On a medium-sized test tree at work with 3500 commits in the tree, 37
on the main Makefile (6 merges), this cuts the annotate time from 10s to a little
over 2, so it's, umm, very worthwhile.)

> 
> Morten
> 
> 
> diff --git a/git-annotate.perl b/git-annotate.perl
> index 3800c46..a5e2d86 100755
> --- a/git-annotate.perl
> +++ b/git-annotate.perl
> @@ -117,7 +117,10 @@ sub init_claim {
> 
>  sub handle_rev {
>         my $i = 0;
> +       my %seen = ();
>         while (my $rev = shift @revqueue) {
> +               next if $seen{$rev};
> +               $seen{$rev} = 1;
> 
>                 my %revinfo = git_commit_info($rev);
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 20:02 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Junio C Hamano, git
In-Reply-To: <20060224192354.GC387@hpsvcnb.fc.hp.com>

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> I see what you're saying about this data reuse helping to speed up
> subsequent cloning operations.  However, if packing takes this long and
> doesn't give me any disk space savings then I don't want to pay the very
> heavy price of packing these files even the first time nor do I want to
> pay the price incrementally.

Of course.  There is admitedly a problem here.  I'm just abusing a bit 
of your time to properly identify its parameters.

> The most I would tolerate for the first pack is a few seconds.  The most
> I would tolerate for any incremental pack is about 1 second.

Well that is probably a bit tight.  Ideally it should be linear with the 
size of the data set to process.  If you have 10 files 10MB each it 
should take about the same time to pack than 10000 files of 10KB each.  
Of course incrementally packing one additional 10MB file might take more 
than a second although it is only one file.
 
> BTW, git repack has been going for 30 minutes and has packed 4/36
> objects.  :-)

Pathetic.

> I think the right answer would be for git to avoid trying to pack files
> like this.  Junio mentioned something like this in his message.

Yes.  First we could add an additional parameter to the repacking 
strategy which is the undeltified but deflated size of an object.  That 
would prevent any deltas to become bigger than the simply deflated 
version.

Remains the delta performance issue.  I think I know what the problem 
is.  I'm not sure I know what the best solution would be though.  The 
patological data set is easy to identify quickly and one strategy might 
simply to bail out early when it happens and therefore not attempt any 
delta.

However, if you could let me play with two samples of your big file I'd 
be grateful.  If so I'd like to make git work well with your data set 
too which is not that uncommon after all.


Nicolas

^ permalink raw reply

* git-mailinfo doesn't get installed any more
From: Tony Luck @ 2006-02-24 20:06 UTC (permalink / raw)
  To: Git Mailing List

Periodically after I upgrade git I do an "ls -lrt /usr/local/bin" to
find stray old binaries that aren't part of git anymore.  I was
a bit surprised to see that git-mailinfo had a mod-time a bit
older than the rest of git ... and looking at the Makefile it looks
like it got dropped in some rearrangement.

Two things:
1) Can someone put it back please, git-applymbox is very unhappy
without it.

2) What's the cute 1-line git way to see when this was broken. I'm
guessing that it involves using a --pickaxe.

Thanks

-Tony

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Linus Torvalds @ 2006-02-24 20:02 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Nicolas Pitre, Junio C Hamano, git
In-Reply-To: <20060224192354.GC387@hpsvcnb.fc.hp.com>



On Fri, 24 Feb 2006, Carl Baldwin wrote:
> 
> I meant that there is no benefit in disk space usage.  Packing may
> actually increase my disk space usage in this case.  Refer to what I
> said about experimentally running gzip and xdelta on the files
> independantly of git.

Yes. The deltas tend to compress a lot less well than "normal" files.

> I see what you're saying about this data reuse helping to speed up
> subsequent cloning operations.  However, if packing takes this long and
> doesn't give me any disk space savings then I don't want to pay the very
> heavy price of packing these files even the first time nor do I want to
> pay the price incrementally.

I would look at tuning the heuristics in "try_delta()" (pack-objects.c) a 
bit. That's the place that decides whether to even bother trying to make a 
delta, and how big a delta is acceptable. For example, looking at them, I 
already see one bug:

	..
        sizediff = oldsize > size ? oldsize - size : size - oldsize;
        if (sizediff > size / 8)
                return -1;
	..

we really should compare sizediff to the _smaller_ of the two sizes, and 
skip the delta if the difference in sizes is bound to be bigger than that.

However, the "size / 8" thing isn't a very strict limit anyway, so this 
probably doesn't matter (and I think Nico already removed it as part of 
his patches: the heuristic can make us avoid some deltas that would be 
ok).

The other thing to look at is "max_size": right now it initializes that to 
"size / 2 - 20", which just says that we don't ever want a delta that is 
larger than about half the result (plus the 20 byte overhead for pointing 
to the thing we delta against). Again, if you feel that normal compression 
compresses better than half, you could try changing that to

	..
	max_size = size / 4 - 20;
	..

or something like that instead (but then you need to check that it's still 
positive - otherwise the comparisons with unsigned later on are screwed 
up. Right now that value is guaranteed to be positive if only because we 
already checked

	..
	if (size < 50)
		return -1;
	..

earlier).

NOTE! Every SINGLE one of those heuristics are just totally made up by 
yours truly, and have no testing behind them. They're more of the type 
"that sounds about right" than "this is how it must be". As mentioned, 
Nico has already been playing with the heuristics - but he wanted better 
packs, not better CPU usage, so he went the other way from what you would 
want to try..

		Linus

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 20:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carl Baldwin, Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241152290.22647@g5.osdl.org>

On Fri, 24 Feb 2006, Linus Torvalds wrote:

> The other thing to look at is "max_size": right now it initializes that to 
> "size / 2 - 20", which just says that we don't ever want a delta that is 
> larger than about half the result (plus the 20 byte overhead for pointing 
> to the thing we delta against). Again, if you feel that normal compression 
> compresses better than half, you could try changing that to
> 
> 	..
> 	max_size = size / 4 - 20;
> 	..

Like I mentioned, max_size should also be caped with the deflated 
undeltified object 
size.  This value is easy to get since plain objects are already 
deflated.

> NOTE! Every SINGLE one of those heuristics are just totally made up by 
> yours truly, and have no testing behind them. They're more of the type 
> "that sounds about right" than "this is how it must be". As mentioned, 
> Nico has already been playing with the heuristics - but he wanted better 
> packs, not better CPU usage, so he went the other way from what you would 
> want to try..

Actually it's a good balance I'm after.

Using 30% more CPU for 10% smaller packs is OK I'd say.

Using 100 times the CPU for 50% saving on only one particular delta is 
not acceptable.

And using more than one hour for 200MB of data with the current window 
default is not acceptable either.


Nicolas

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 20:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241438521.31162@localhost.localdomain>

On Fri, Feb 24, 2006 at 03:02:07PM -0500, Nicolas Pitre wrote:
> Well that is probably a bit tight.  Ideally it should be linear with the 
> size of the data set to process.  If you have 10 files 10MB each it 
> should take about the same time to pack than 10000 files of 10KB each.  
> Of course incrementally packing one additional 10MB file might take more 
> than a second although it is only one file.

Well, I might not have been fair here.  I tried an experiment where I
packed each of the twelve large blob objects explicitly one-by-one using
git-pack-objects.  Incrementally packing each single object was very
fast.  Well under a second per object on my machine.

After the twelve large objects were packed into individual packs the
rest of the packing went very quickly and git v1.2.3's date reuse worked
very well.  This was sort of my attempt at simulating how things would
be if git avoided deltification of each of these large files. I'm sorry
to have been so harsh earlier I just didn't understand that
incrementally packing one-by-one was going to help this much.

This gives me hope that if somehow git were to not attempt to deltify
these objects then performance would be much better than acceptible.

[snip]
> However, if you could let me play with two samples of your big file I'd 
> be grateful.  If so I'd like to make git work well with your data set 
> too which is not that uncommon after all.

I would be happy to do this.  I will probably need to scrub a bit and
make sure that the result shows the same characteristics.  How would you
like me to deliver these files to you?  They are about 25 MB deflated.

Carl

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply

* Re: git-mailinfo doesn't get installed any more
From: Linus Torvalds @ 2006-02-24 20:42 UTC (permalink / raw)
  To: Tony Luck; +Cc: Git Mailing List
In-Reply-To: <12c511ca0602241206jaea9f75pce4ca687f5b2fd3c@mail.gmail.com>



On Fri, 24 Feb 2006, Tony Luck wrote:
> 
> 2) What's the cute 1-line git way to see when this was broken. I'm
> guessing that it involves using a --pickaxe.

You've actually found an interesting misfeature in git. There's a merge 
error, and you can't see it in the diffs by default because it wasn't due 
to a _clashing_ content thing, but two edits that were far enough away 
from each other.

That "git-mailinfo" thing is there in rev 2a3763ef, but it's not there in 
the current Makefile. And doing a

	git-whatchanged -p 2a3763ef.. | grep git-mailinfo

results in nothing. Which is not good.

Anyway, the way to handle that is to do "git bisect" (and use "grep 
git-mailinfo Makefile" in between bisection points to see if git-mailinfo 
is still part of the list of programs):

	git-bisect start
	# bad: [20d23f554d6cd40ffa0d41ccc9416bca867667e0] gitview: Bump the rev
	git-bisect bad 20d23f554d6cd40ffa0d41ccc9416bca867667e0
	# good: [2a3763ef3d26eb38c0a47997b8e5fd2a7c5214cc] avoid makefile override warning
	git-bisect good 2a3763ef3d26eb38c0a47997b8e5fd2a7c5214cc
	# bad: [ee072260dbff6914c24d956bcc2d46882831f1a0] Merge branch 'jc/nostat'
	git-bisect bad ee072260dbff6914c24d956bcc2d46882831f1a0
	# good: [551ce28fe1f2777eee7dd9c02bd44f55f4b32361] git-svn: 0.9.1: add --version and copyright/license (GPL v2+) information
	git-bisect good 551ce28fe1f2777eee7dd9c02bd44f55f4b32361
	# good: [5508a616631fb41531b638f744bd92c701727014] New test to verify that when git-clone fails it cleans up the new directory.
	git-bisect good 5508a616631fb41531b638f744bd92c701727014
	# bad: [712b1dd389ad5bcdbaab0279641f0970702fc1f1] Merge branch 'js/portable'
	git-bisect bad 712b1dd389ad5bcdbaab0279641f0970702fc1f1
	# good: [d800795613a710fb18353af53730e75185861f41] gitview: Use monospace font to draw the branch and tag name
	git-bisect good d800795613a710fb18353af53730e75185861f41
	# good: [b992933853ccffac85f7e40310167ef7b8f0432e] Fix "gmake -j"
	git-bisect good b992933853ccffac85f7e40310167ef7b8f0432e

resulting in:

	712b1dd389ad5bcdbaab0279641f0970702fc1f1 is first bad commit

which shows that there was a bad merge by Junio.

You can use

	git show -c -p 712b1dd389ad5bcdbaab0279641f0970702fc1f1

to see why. It merged the thing perfectly fine, but sadly, incorrectly. 

Somebody should probably look at whether we could have done things better, 
but I suspect that merge errors are inevitable with any automated process.

Anyway, it might be worth remembering that 712b1dd3 merge for future 
testing. Make a test-case out of it.

		Linus

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Junio C Hamano @ 2006-02-24 20:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Linus Torvalds, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602241152290.22647@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> NOTE! Every SINGLE one of those heuristics are just totally made up by 
> yours truly, and have no testing behind them. They're more of the type 
> "that sounds about right" than "this is how it must be". As mentioned, 
> Nico has already been playing with the heuristics - but he wanted better 
> packs, not better CPU usage, so he went the other way from what you would 
> want to try..

I haven't looked at Nico's original or updated code closely at
all, but two things come to mind.

(1) if we could tell the particular data is intrinsically
    diff_delta unfriendly and diff_delta would waste too much
    time when tried to delta against almost _any_ other blob,
    then it might help to give an interface in diff-delta.c for
    the caller to check for such a blob without even trying
    diff_delta.

(2) otherwise, if diff_delta could detect it would spend too
    many cycles to finish its work for a particular input early
    on, we might want it to bail out instead of trying a
    complete job.

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 21:12 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Junio C Hamano, git
In-Reply-To: <20060224204022.GA15962@hpsvcnb.fc.hp.com>

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> On Fri, Feb 24, 2006 at 03:02:07PM -0500, Nicolas Pitre wrote:
> > Well that is probably a bit tight.  Ideally it should be linear with the 
> > size of the data set to process.  If you have 10 files 10MB each it 
> > should take about the same time to pack than 10000 files of 10KB each.  
> > Of course incrementally packing one additional 10MB file might take more 
> > than a second although it is only one file.
> 
> Well, I might not have been fair here.  I tried an experiment where I
> packed each of the twelve large blob objects explicitly one-by-one using
> git-pack-objects.  Incrementally packing each single object was very
> fast.  Well under a second per object on my machine.
> 
> After the twelve large objects were packed into individual packs the
> rest of the packing went very quickly and git v1.2.3's date reuse worked
> very well.  This was sort of my attempt at simulating how things would
> be if git avoided deltification of each of these large files. I'm sorry
> to have been so harsh earlier I just didn't understand that
> incrementally packing one-by-one was going to help this much.

Hmmmmmmm....

I don't think I understand what is going on here.

You say that, if you add those big files and incrementally repack after 
each commit using git repack with no option, then it requires only about 
one second each time.  Right?

But if you use "git-repack -a -f" then it is gone for more than an hour?

I'd expect something like 2 * (sum i for i = 1 to 10) i.e. in the 110 
second range due to the combinatorial effect when repacking everything.  
This is far from one hour and something appears to be really really 
wrong.

How many files besides those 12 big blobs do you have?

> This gives me hope that if somehow git were to not attempt to deltify
> these objects then performance would be much better than acceptible.
> 
> [snip]
> > However, if you could let me play with two samples of your big file I'd 
> > be grateful.  If so I'd like to make git work well with your data set 
> > too which is not that uncommon after all.
> 
> I would be happy to do this.  I will probably need to scrub a bit and
> make sure that the result shows the same characteristics.  How would you
> like me to deliver these files to you?  They are about 25 MB deflated.

If you can add them into a single .tgz with instructions on how 
to reproduce the issue and provide me with an URL where I can fetch it 
that'd be perfect.


Nicolas

^ permalink raw reply

* Re: Removal of "--merge-order"?
From: Johannes Schindelin @ 2006-02-24 21:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Randy.Dunlap, Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0602240957430.22647@g5.osdl.org>

Hi,

On Fri, 24 Feb 2006, Linus Torvalds wrote:

> Now, rev-list.c is not the biggest file (apply.c is about twice the size), 
> but in many ways it's the most complex one by far. It's also the most 
> performance-critical one, and the one that it would be really nice if we 
> were to be able to libify it.

This is what I wanted to try today, but unfortunately I had to do real 
work :-(

> For example, instead of the horrid scriping language, I _think_ I could 
> almost libify it by just hooking into "show_commit", and using a callback 
> function for that (and then the stand-alone program would just make the 
> callback function be one that prints out the commit). 

I don't find the scripting language you invented particularly horrid. 
Maybe some odd things (like "if" branching to the "else" block whenever 
*any* argument was passed), but not horrid.

But in the end I would prefer a libified git, if only to get rid of 
double parsing (if you pipe the output of git-rev-list to another git 
program, chances are that you parse the commit objects at least twice).

> That would possibly be a simpler way to get away from using nonportable 
> scripts. Plain C really does remain one of the most portable things out 
> there.

Yes.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 21:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Linus Torvalds, Carl Baldwin
In-Reply-To: <7vpslc8oni.fsf@assigned-by-dhcp.cox.net>

On Fri, 24 Feb 2006, Junio C Hamano wrote:

> I haven't looked at Nico's original or updated code closely at
> all, but two things come to mind.
> 
> (1) if we could tell the particular data is intrinsically
>     diff_delta unfriendly and diff_delta would waste too much
>     time when tried to delta against almost _any_ other blob,
>     then it might help to give an interface in diff-delta.c for
>     the caller to check for such a blob without even trying
>     diff_delta.
> 
> (2) otherwise, if diff_delta could detect it would spend too
>     many cycles to finish its work for a particular input early
>     on, we might want it to bail out instead of trying a
>     complete job.

I have a patch that implements an hybrid approach.

Currently, diff-delta takes blocks of data in the reference file and 
hash them.  When the target file is scanned, it uses the hash to match 
blocks from the target file with the reference file.

If blocks are hashed evenly the cost of  producing a delta is at most 
O(n+m) where n and m are the size of the reference and target files 
respectively.  In other words, with good data set the cost is linear.

But if many blocks from the reference buffer do hash to the same bucket 
then for each block in the target file many blocks from the reference 
buffer have to be tested against, making it tend towards O(n^m) which is 
pretty highly exponential.

The solution I'm investigating is to put a limit on the number of 
entries in the same hash bucket so to bring the cost back to something 
more linear.  That means the delta might miss on better matches that 
have not been hashed but still benefit from a limited set. Experience 
seems to show that the time to deltify the first two blobs you found to 
be problematic can be reduced by 2 orders of magnitude with about only 
10% increase in the resulting delta size, and still nearly 40% smaller 
than what the current delta code produces.

The question is how to determine the best limit on the number of entries 
in the same hash bucket.


Nicolas

^ permalink raw reply

* Re: [PATCH] New git-seek command with documentation and test.
From: Johannes Schindelin @ 2006-02-24 21:48 UTC (permalink / raw)
  To: Carl Worth; +Cc: Andreas Ericsson, Junio C Hamano, git, Linus Torvalds
In-Reply-To: <87oe0wrg29.wl%cworth@cworth.org>

Hi,

On Fri, 24 Feb 2006, Carl Worth wrote:

> On Fri, 24 Feb 2006 11:00:29 +0100, Andreas Ericsson wrote:
> > 
> > I've said it before, and I'll say it again. This tool provides less 
> > flexibility and much less power than "git checkout -b branch 
> > <commit-ish>"
> 
> Yes, that's by design. It's not intended to be a replacement for git
> checkout -b.

I do not really understand why.

git-seek shares so many characteristics with git-seek, you could make 
git-seek just another command line option to checkout (like "--temporary" 
and "--go-back").

Hth,
Dscho

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-24 21:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Linus Torvalds, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602241613030.31162@localhost.localdomain>

On Fri, 24 Feb 2006, Nicolas Pitre wrote:

> If blocks are hashed evenly the cost of  producing a delta is at most 
> O(n+m) where n and m are the size of the reference and target files 
> respectively.  In other words, with good data set the cost is linear.
> 
> But if many blocks from the reference buffer do hash to the same bucket 
> then for each block in the target file many blocks from the reference 
> buffer have to be tested against, making it tend towards O(n^m) which is 
> pretty highly exponential.

Well, actually this is rather O(n*m) not O(n^m), but bad nevertheless.


Nicolas

^ permalink raw reply

* Re: [PATCH] New git-seek command with documentation and test.
From: J. Bruce Fields @ 2006-02-24 21:57 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Carl Worth, Andreas Ericsson, Junio C Hamano, git, Linus Torvalds
In-Reply-To: <Pine.LNX.4.63.0602242246430.11479@wbgn013.biozentrum.uni-wuerzburg.de>

On Fri, Feb 24, 2006 at 10:48:46PM +0100, Johannes Schindelin wrote:
> git-seek shares so many characteristics with git-seek, you could make 
> git-seek just another command line option to checkout (like "--temporary" 
> and "--go-back").

Well, as a user interface, git-seek seems a bit simpler (e.g., easier to
remember).--b.

^ permalink raw reply

* [PATCH] git ls files recursively show ignored files
From: Shawn Pearce @ 2006-02-24 22:02 UTC (permalink / raw)
  To: git

Make git-ls-files --others --ignored recurse into non-excluded
subdirectories.

Typically when asking git-ls-files to display all files which are
ignored by one or more exclude patterns one would want it to recurse
into subdirectories which are not themselves excluded to see if
there are any excluded files contained within those subdirectories.

---
 I found this issue while trying to find temporary garbage that was
 created within a tracked subdirectory:

   touch a/b/foo#1
   git-ls-files --others --ignored --exclude='*#1'

 would never display a/b/foo#1 as the directory 'a' was not itself
 excluded.  It would be rather nice if git-ls-files actually
 showed it.

 ls-files.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

base 98214e96be00c5132047ae80bca20d4690933c33
last 809da5f8a21a10112eece4ee9be55fe64371ce68
diff --git a/ls-files.c b/ls-files.c
index 90b289f03d987c6c90214fc12d00c30b4e28bb27..df25c8c012a96a8277413ca3a81490b81b7dc067 100644
--- a/ls-files.c
+++ b/ls-files.c
@@ -279,8 +279,11 @@ static void read_directory(const char *p
 				continue;
 			len = strlen(de->d_name);
 			memcpy(fullname + baselen, de->d_name, len+1);
-			if (excluded(fullname) != show_ignored)
-				continue;
+			if (excluded(fullname) != show_ignored) {
+				if (!show_ignored || DTYPE(de) != DT_DIR) {
+					continue;
+				}
+			}
 
 			switch (DTYPE(de)) {
 			struct stat st;
-- 
1.2.3.g809d

^ permalink raw reply related

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 22:50 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241544270.31162@localhost.localdomain>

On Fri, Feb 24, 2006 at 04:12:14PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
> > After the twelve large objects were packed into individual packs the
> > rest of the packing went very quickly and git v1.2.3's date reuse worked
> > very well.  This was sort of my attempt at simulating how things would
> > be if git avoided deltification of each of these large files. I'm sorry
> > to have been so harsh earlier I just didn't understand that
> > incrementally packing one-by-one was going to help this much.
> 
> Hmmmmmmm....
> 
> I don't think I understand what is going on here.
> 
> You say that, if you add those big files and incrementally repack after 
> each commit using git repack with no option, then it requires only about 
> one second each time.  Right?

Well, actually I was packing them individually by calling
git-pack-objects directly on each blob.

I'll try doing it exactly as you describe...

Ok, I tried it.  Basically I do the following.

% mkdir test
% cd test
% git init-db
% cp ../files/binfile.1 binfile
% time git add binfile

real    0m2.459s
user    0m2.443s
sys     0m0.019s
% git commit -a -m "Rev 1"
% time git repack
[snip]

real    0m1.111s
user    0m1.046s
sys     0m0.061s
% for i in $(seq 2 12); do
    cp ../files/binfile.$i binfile
    time git commit -a -m "Rev $i"
    time git repack
done

Each commit takes around 2.8-3.5 seconds and each repack takes about
1.2-1.5 seconds.  These are prettly reasonable.

Now, I try 'git repack -a -f' (or even without -f) and it goes out to
lunch.  I think it would take on the order of a day to actually finish
because it wasn't very far after an hour.

[snip]
> How many files besides those 12 big blobs do you have?

This repository has been completely stripped to the 12 revisions of the
one file.  So, there are 36 objects.

12 blobs.
12 trees.
12 commits.

That is all.

[snip]
> If you can add them into a single .tgz with instructions on how 
> to reproduce the issue and provide me with an URL where I can fetch it 
> that'd be perfect.

I will do this in an email off of the list because these files really
shouldn't be available on a public list.

Carl

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox