Git development
 help / color / mirror / Atom feed
* Re: [PATCH] update-cache.c ignore directories
From: Fabian Franz @ 2005-04-22 22:27 UTC (permalink / raw)
  To: atani; +Cc: GIT Mailing List
In-Reply-To: <1114208707.12699@tsunami.he.net>

Am Samstag, 23. April 2005 00:25 schrieb atani:

> Now it spits out:
> -------------
> 'plx' is a directory, ignoring
> -------------

I saw that you spit this out to stdout. Wouldn't it be better to spit it out 
to stderr (even if its just a warning)?

cu

Fabian


^ permalink raw reply

* [PATCH] update-cache.c ignore directories
From: atani @ 2005-04-22 22:25 UTC (permalink / raw)
  To: GIT Mailing List

--- sorry if this dupes, mail client issues... 
 
In my tests of using git (both Linus and pasky versions) I had a 
problem with  
doing "gitadd.sh *" where * expands to include directories.  This 
simple  
patch allows update-cache.c to more gracefully handle a directory 
being  
passed into the add_file_to_cache method.  Without this patch 
update-cache  
exits prematurely with an error similar to: 
------------- 
fatal: Unable to add plx to database 
------------- 
 
Now it spits out: 
------------- 
'plx' is a directory, ignoring 
------------- 
 
Which from an end user stand point is better. 
 
BTW, so far my tests of using git are positive for my small Dreamcast 
software  
projects...  I was previously using subversion but find it to be a bit 
of  
overkill for these small projects. 
 
Martin Schlemmer,  I ran "emerge sync" today and found git has been 
added to  
portage, version 0.5.  Also note that there are now two "git" entries 
within  
portage app-misc/git and dev-util/git.  app-misc/git is GNU 
Interactive Tools 
 
Mike 
 
Signed-off-by: Mike Dunston (atani@atani-software.net) 
 
Index: update-cache.c 
=================================================================== 
--- 690494557d393ca78f69a8569880ed4a3aeda276/update-cache.c  
(mode:100644  
sha1:4353b80890ba2afbe22248a4dc25060aa4a429b2) 
+++ uncommitted/update-cache.c  (mode:100644) 
@@ -104,6 +104,11 @@ 
                close(fd); 
                return -1; 
        } 
+       if(S_ISDIR(st.st_mode)) { 
+               printf("'%s' is a directory, ignoring\n", path); 
+               close(fd); 
+               return 0; 
+       } 
        namelen = strlen(path); 
        size = cache_entry_size(namelen); 
        ce = malloc(size); 

^ permalink raw reply

* Re: [PATCH] More docs
From: Junio C Hamano @ 2005-04-22 22:23 UTC (permalink / raw)
  To: David Greaves; +Cc: git
In-Reply-To: <4269704A.9090503@dgreaves.com>

>>>>> "DG" == David Greaves <david@dgreaves.com> writes:

DG> Merging
DG> If -m is specified, read-tree performs 2 kinds of merge, a subservient
DG> tree-read if only 1 tree is given or a 3-way merge if 3 trees are
DG> provided.

AFAICR Linus never used the word "subservient" to describe this
case [*R1*].  I do not know if the word is a good fit for
describing what it does.  Sorry, I cannot help you in deciding
if this is the right word nor in picking a better word.  I am
not a native speaker so I had to look it up in the dictionary.

DG> Furthermore, "read-tree" has special-case logic that says: if you see
DG> a file that matches in all respects in the following states, it
DG> "collapses" back to "stage0":
DG>      - stage 2 and 3 are the same
DG>      - stage 1 and stage 2 are the same and stage 3 is different
DG>      - stage 1 and stage 3 are the same and stage 2 is different

That is what I wrote so I should say "sounds good", but after
re-reading it I realize we should describe how these trivial
ones are resolved, like so:

    Furthermore, "read-tree" has special-case logic that says: if you see
    a file that matches in all respects in the following states, it
    "collapses" back to "stage0":

     - stage 2 and 3 are the same;
       take one or the other (it does not make a difference)
     - stage 1 and stage 2 are the same and stage 3 is different;
       take stage 3
     - stage 1 and stage 3 are the same and stage 2 is different
       take stage 2
    
DG> show-files
DG> show-files [-z] [-t] (--[cached|deleted|others|ignored|stage])*
>> Although I like it, I do not think -t is in core.  It is Pasky.
DG> Well, it says Copyright (C) Linus Torvalds, 2005 - and Linus describes
DG> it in his discussion so...

My comment was only about the '-t' option.  It is not one of the
options in the core.  Pasky may want to feed the change to
Linus.


[References]
*R1*

    Date:	Tue, 19 Apr 2005 11:27:34 -0700 (PDT)
    From:	Linus Torvalds <torvalds@osdl.org>
    Subject: Re: naive question
    Message-ID: <Pine.LNX.4.58.0504191117570.19286@ppc970.osdl.org>

    On Tue, 19 Apr 2005, Linus Torvalds wrote:
    > 
    > The real expense right now of a merge is that we always forget all the
    > stat information when we do a merge (since it does a read-tree). I have a
    > cunning way to fix that, though, which is to make "read-tree -m" read in
    > the old index state like it used to, and then at the end just throw it
    > away except for the stat information.

    Ok, done. That was really the plan all along, it just got dropped in the 
    excitement of trying to get the dang thing to _work_ in the first place ;)

    ... I'll also make it do the same for a "single-tree merge":

            read-tree -m <newtree>

    so that you can basically say "read a new tree, and merge the stat 
    information from the current cache".  That means that if you do a
    "read-tree -m <newtree>" followed by a "checkout-cache -f -a", the 
    checkout-cache only checks out the stuff that really changed.


^ permalink raw reply

* (unknown), 
From: atani @ 2005-04-22 22:19 UTC (permalink / raw)
  To: GIT Mailing List

In my tests of using git (both Linus and pasky versions) I had a 
problem with doing "gitadd.sh *" where * expands to include 
directories. This simple patch allows update-cache.c to more 
gracefully handle a directory being passed into the add_file_to_cache 
method.  Without this patch update-cache exits prematurely with an 
error similar to: 
------------- 
fatal: Unable to add plx to database 
------------- 
 
Now it spits out: 
------------- 
'plx' is a directory, ignoring 
------------- 
 
Which from an end user stand point is better. 
 
BTW, so far my tests of using git are positive for my small Dreamcast 
software projects...  I was previously using subversion but find it to 
be a bit of overkill for these small projects. 
 
Martin Schlemmer,  I ran "emerge sync" today and found git has been 
added to portage, version 0.5.  Also note that there are now two "git" 
entries within portage app-misc/git and dev-util/git.  app-misc/git is 
GNU Interactive Tools 
 
Mike 
 
Signed-off-by: Mike Dunston (atani@atani-software.net) 
 
Index: update-cache.c 
=================================================================== 
--- 690494557d393ca78f69a8569880ed4a3aeda276/update-cache.c  
(mode:100644  
sha1:4353b80890ba2afbe22248a4dc25060aa4a429b2) 
+++ uncommitted/update-cache.c  (mode:100644) 
@@ -104,6 +104,11 @@ 
                close(fd); 
                return -1; 
        } 
+       if(S_ISDIR(st.st_mode)) { 
+               printf("'%s' is a directory, ignoring\n", path); 
+               close(fd); 
+               return 0; 
+       } 
        namelen = strlen(path); 
        size = cache_entry_size(namelen); 
        ce = malloc(size); 

^ permalink raw reply

* Re: [OT] git logo or mascot (was: Re: wit 0.0.3 - a web interface for git available)
From: Timothy R. Chavez @ 2005-04-23  7:19 UTC (permalink / raw)
  To: Fabian Franz; +Cc: git
In-Reply-To: <200504222322.33934.FabianFranz@gmx.de>

On Friday 22 April 2005 21:22, you wrote:
> Am Freitag, 22. April 2005 23:09 schrieb Greg KH:
> > On Fri, Apr 22, 2005 at 10:35:22PM +0200, Christian Meder wrote:
> > > On Thu, 2005-04-21 at 00:33 -0700, Greg KH wrote:
> > > > Very nice, this looks great.  And hey, we have a git logo now :)
> > >
> > > BTW is this logo already officially blessed ?
> >
> > "blessed" how?
>
> Well if it should be official for git, Linus has to "bless" it imho.
>
> > Have an alternative one?
>
> Well, yes. I would like something, which has something to do with the
> original linux mascot.
>
> Like a fish (though that is already used by some other projects) - penguins
> like fish! (Tux too? Or more Linus' fingers ;)? )

I think the logo should be a tortoise.  Why?  Because it's somewhat ironic.  
One of Linus' complaints about SCMs in-general and one of his praises of 
BitKeeper was performance/speed.  The tortoise is typified as being a slow 
and cumbersome animal.  Also, the hare, the tortoise' competition, is 
typified as being quick and possessing fecundity (things that you'd want to 
identify with "git"... and the job of any decent C programmer is to obfuscate 
meaning when possible J/K).  

Oh, and there's a picture of a tortoise named "git.gif"... 

http://www.angelfire.com/oh5/juniorglory/images/git.gif

Perhaps not.  Hah

-tim

>
> Or perhaps even a wife for Tux, which helps him organizing and managing?
>
> ... Tux & Git having nice holidays on an island? ...
>
> Hm, but what is cogito then ...
>
> cu
>
> Fabian
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: "GIT_INDEX_FILE" environment variable
From: Linus Torvalds @ 2005-04-22 22:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <7vbr867ecy.fsf@assigned-by-dhcp.cox.net>



On Fri, 22 Apr 2005, Junio C Hamano wrote:
> 
> Almost, with a counter-example.  Please try this yourself:

I agree that what git outputs is always "based on the archive base". But 
that's an independent issue from "where is the working directory". That's 
the issue of "how do you want me to print out the results".

To see just how independent that is, think about how git-pasky (and,
indeed, standard "show-diff") already prints out the results in a
_different_ base than the working directory _or_ the base. Ie the way we 
already do

	--- a/Makefile
	+++ b/Makefile
	... patch ...

for a patch to "Makefile" in the top-level directory.

IOW, showing pathnames is different from _using_ them. And if you were 
planning on using the same logic for both, you'd have been making a 
mistake in the first place.

To _use_ pathnames, you use "pwd". To _show_ them, you use some other
mechanism. You must not mix up those two issues, or you'd always get
"show-diff" wrong.

I actually think that showing the pathnames is up to the wrapper scripts. 
Git core really always just works on the "canonical" format.

(And I personally think that "show-diff" is really part of the "wrapper
scripts" around git. I wrote it originally just because I needed something
to verify the index file handling, not because it's "core" like the other
programs. I do _not_ consider "show-diff" to be part of the core git code,
really. Same goes for "git-export", btw - for the same reasons. It's not
"fundamental").

		Linus

^ permalink raw reply

* compiling git with ZLIB < 1.2
From: Andreas Gal @ 2005-04-22 22:12 UTC (permalink / raw)
  To: git
In-Reply-To: <S262157AbVDVVWs/20050422212248Z+375@vger.kernel.org>


deflateBound() was added in ZLIB 1.2, but there is unfortunately no easy 
way to check against the ZLIB version. I would suggest to use the fix 
below until everyone has a recent ZLIB installed (neither my RHEL3 nor my 
Darwin box does by default).

Andreas

--- a/sha1_file.c
+++ b/sha1_file.c
@@ -240,7 +240,11 @@ int write_sha1_file(char *buf, unsigned 
 	/* Set it up */
 	memset(&stream, 0, sizeof(stream));
 	deflateInit(&stream, Z_BEST_COMPRESSION);
+#ifdef ONCE_EVERYONE_UPDATED_TO_ZLIB_12
 	size = deflateBound(&stream, len);
+#else	
+	size = len + ((len + 7) >> 3) + ((len + 63) >> 6) + 11;
+#endif	
 	compressed = malloc(size);
 
 	/* Compress it */

^ permalink raw reply

* Re: [PATCH] More docs
From: David Greaves @ 2005-04-22 21:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vwtqu5ymu.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:

Thanks for the comments Junio

>>>>>>"DG" == David Greaves <david@dgreaves.com> writes:
> 
> This is Cogito invention, not in the core.  Neither is tree-id.
OK - hard to tell sometimes...

> DG> +	/0 line termination on output
> 
> Write this either '\0' (for C literate) or NUL (ASCII character
> name), please.  The same for other commands with -z.
hmm, this...

> DG> +--cached
> DG> +	Cached only (private?)
> 
> What?  The beauty of diff-tree is it does not care about
> dircache at all.  Maybe this is a Pasky addition, but I wonder
> what the semantics of this option and why it is here...
...and this, are pre-final-polish
I must have committed the wrong one - <sigh>

> DG> +NOTE NOTE NOTE! although read-tree coule do some of these nontrivial
> DG> +merges, only the "matches in all three states" thing collapses by
> DG> +default.
> 
> The above "NOTE" is taken from the initial message from Linus
> but it is no longer true.  These days, it merges when:
OK - will edit

how does this sound:

Merging
If -m is specified, read-tree performs 2 kinds of merge, a subservient
tree-read if only 1 tree is given or a 3-way merge if 3 trees are
provided.

Subservient Tree Read
If only 1 tree is specified, read-tree operates as if the user did not
specify "-m", except that if the original cache has an entry for a
given pathname; and the contents of the path matches with the tree
being read, the stat info from the cache is used. (In other words, the
cache's stat()s take precedence over the subservient tree's)

This is used to avoid unnecessary false hits when show-diff is
run after read-tree.

3-Way Merge
Each "index" entry has two bits worth of "stage" state. stage 0 is the
normal one, and is the only one you'd see in any kind of normal use.

However, when you do "read-tree" with multiple trees, the "stage"
starts out at 0, but increments for each tree you read. And in
particular, the "-m" flag means "start at stage 1" instead.

This means that you can do

	read-tree -m <tree1> <tree2> <tree3>

and you will end up with an index with all of the <tree1> entries in
"stage1", all of the <tree2> entries in "stage2" and all of the
<tree3> entries in "stage3".

Furthermore, "read-tree" has special-case logic that says: if you see
a file that matches in all respects in the following states, it
"collapses" back to "stage0":
     - stage 2 and 3 are the same
     - stage 1 and stage 2 are the same and stage 3 is different
     - stage 1 and stage 3 are the same and stage 2 is different

NOTE NOTE NOTE para removed.

> DG>  ################################################################
> DG> @@ -151,8 +603,145 @@
> DG>  show-files
> DG>  	show-files [-z] [-t] (--[cached|deleted|others|ignored|stage])*
>  
> Although I like it, I do not think -t is in core.  It is Pasky.
Well, it says Copyright (C) Linus Torvalds, 2005 - and Linus describes 
it in his discussion so...

> Also you missed "show-files --unmerged".
I did - that's me using the usage() string!!

David

-- 

^ permalink raw reply

* [OT] git logo or mascot (was: Re: wit 0.0.3 - a web interface for git available)
From: Fabian Franz @ 2005-04-22 21:22 UTC (permalink / raw)
  To: Greg KH, Christian Meder; +Cc: Kay Sievers, git
In-Reply-To: <20050422210905.GB1829@kroah.com>

Am Freitag, 22. April 2005 23:09 schrieb Greg KH:
> On Fri, Apr 22, 2005 at 10:35:22PM +0200, Christian Meder wrote:
> > On Thu, 2005-04-21 at 00:33 -0700, Greg KH wrote:
> > > Very nice, this looks great.  And hey, we have a git logo now :)
> >
> > BTW is this logo already officially blessed ?
>
> "blessed" how?

Well if it should be official for git, Linus has to "bless" it imho.

> Have an alternative one?

Well, yes. I would like something, which has something to do with the original 
linux mascot.

Like a fish (though that is already used by some other projects) - penguins 
like fish! (Tux too? Or more Linus' fingers ;)? )

Or perhaps even a wife for Tux, which helps him organizing and managing?

... Tux & Git having nice holidays on an island? ...

Hm, but what is cogito then ...

cu

Fabian


^ permalink raw reply

* Re: wit 0.0.3 - a web interface for git available
From: Christian Meder @ 2005-04-22 21:16 UTC (permalink / raw)
  To: Greg KH; +Cc: Kay Sievers, git
In-Reply-To: <20050422210905.GB1829@kroah.com>

On Fri, 2005-04-22 at 14:09 -0700, Greg KH wrote:
> On Fri, Apr 22, 2005 at 10:35:22PM +0200, Christian Meder wrote:
> > On Thu, 2005-04-21 at 00:33 -0700, Greg KH wrote:
> > > On Thu, Apr 21, 2005 at 03:28:27AM +0200, Kay Sievers wrote:
> > > > On Wed, Apr 20, 2005 at 10:42:53AM +0100, Christoph Hellwig wrote:
> > > > > On Tue, Apr 19, 2005 at 09:18:29PM -0700, Greg KH wrote:
> > > > > > On Wed, Apr 20, 2005 at 02:29:11AM +0200, Christian Meder wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > ok it's starting to look like spam ;-)
> > > > > > > 
> > > > > > > I uploaded a new version of wit to http://www.absolutegiganten.org/wit
> > > > > > 
> > > > > > Why not work together with Kay's tool:
> > > > > > 	http://ehlo.org/~kay/gitweb.pl?project=linux-2.6&action=show_log
> > > > > 
> > > > > That one looks really nice.  One major feature I'd love to see would
> > > > > be a show all diffs link for a changeset.
> > > > 
> > > > It's working now:
> > > >   http://ehlo.org/~kay/gitweb.pl
> > > > 
> > > > Many thanks to Christian Gierke for all the interface work, the nice
> > > > layout and the git logo. Thanks for the colored diff to Ken Brush.
> > > 
> > > Very nice, this looks great.  And hey, we have a git logo now :)
> > 
> > BTW is this logo already officially blessed ?
> 
> "blessed" how?

Linus likes it ;-)

> Have an alternative one?

Nope. I actually like the one on gitweb.


			Christian

-- 
Christian Meder, email: chris@absolutegiganten.org

The Way-Seeking Mind of a tenzo is actualized 
by rolling up your sleeves.

                (Eihei Dogen Zenji)


^ permalink raw reply

* Re: First web interface and service API draft
From: Christian Meder @ 2005-04-22 20:57 UTC (permalink / raw)
  To: Jan Harkes; +Cc: git
In-Reply-To: <20050422142342.GG30915@delft.aura.cs.cmu.edu>

On Fri, 2005-04-22 at 10:23 -0400, Jan Harkes wrote:
> On Fri, Apr 22, 2005 at 12:41:56PM +0200, Christian Meder wrote:
> > -------
> > /<project>/blob/<blob-sha1>
> > /<project>/commit/<commit-sha1>
> 
> It is trivial to find an object when given a sha, but to know the object
> type you'd have to decompress it and check inside. Also the way git
> stores these things you can't have both a blob and a commit with the
> same sha anyways.
> 
> So why not use,
>     /<project/<hexadecimal sha1 representation>
> 	will give you the raw object.
> 
>     /<project/<hexadecimal sha1 representation>.html (.xml/.txt)
> 	will give you a parsed version for user presentation
> 
> And since hexadecimal numbers only have [0-9a-f] as valid characters,
> you can still have additional directories that can be guaranteed unique
> as long as the first two characters are not a valid hexadecimal value.
> So things like /branch/linus, or /changelog/, /log/, /diff/. Yeah, you
> can't use /delta/ without looking at more than the first two characters,
> but that's where dictionaries can come in handy.

Hmm. I'm not sure about throwing away the <objecttype> information in
the url. I think I'd prefer to retain the blob, tree and commit
namespaces because I think they help API users to explicitly state what
kind of object they expect. I can't think of a scenario where I'd want a
<sha1> of unknown type. Do you have a specific use case in mind ?



				Christian     
-- 
Christian Meder, email: chris@absolutegiganten.org

The Way-Seeking Mind of a tenzo is actualized 
by rolling up your sleeves.

                (Eihei Dogen Zenji)


^ permalink raw reply

* Re: wit 0.0.3 - a web interface for git available
From: Greg KH @ 2005-04-22 21:09 UTC (permalink / raw)
  To: Christian Meder; +Cc: Kay Sievers, git
In-Reply-To: <1114202122.3207.4.camel@localhost>

On Fri, Apr 22, 2005 at 10:35:22PM +0200, Christian Meder wrote:
> On Thu, 2005-04-21 at 00:33 -0700, Greg KH wrote:
> > On Thu, Apr 21, 2005 at 03:28:27AM +0200, Kay Sievers wrote:
> > > On Wed, Apr 20, 2005 at 10:42:53AM +0100, Christoph Hellwig wrote:
> > > > On Tue, Apr 19, 2005 at 09:18:29PM -0700, Greg KH wrote:
> > > > > On Wed, Apr 20, 2005 at 02:29:11AM +0200, Christian Meder wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > ok it's starting to look like spam ;-)
> > > > > > 
> > > > > > I uploaded a new version of wit to http://www.absolutegiganten.org/wit
> > > > > 
> > > > > Why not work together with Kay's tool:
> > > > > 	http://ehlo.org/~kay/gitweb.pl?project=linux-2.6&action=show_log
> > > > 
> > > > That one looks really nice.  One major feature I'd love to see would
> > > > be a show all diffs link for a changeset.
> > > 
> > > It's working now:
> > >   http://ehlo.org/~kay/gitweb.pl
> > > 
> > > Many thanks to Christian Gierke for all the interface work, the nice
> > > layout and the git logo. Thanks for the colored diff to Ken Brush.
> > 
> > Very nice, this looks great.  And hey, we have a git logo now :)
> 
> BTW is this logo already officially blessed ?

"blessed" how?
Have an alternative one?

thanks,

greg k-h

^ permalink raw reply

* Re: [FYI] Cogito rsync/download location moved
From: Mr. White @ 2005-04-22 21:07 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git
In-Reply-To: <20050422180057.GF7173@pasky.ji.cz>

Petr Baudis wrote:

>  Hello,
>
>  I'm happy to announce that the Cogito rsync location changed to
>
>	rsync://rsync.kernel.org/pub/scm/cogito/cogito.git
>
>  Please update your .git/remotes accordingly.
>
>  Also please note that the Cogito download location changed too. From
>now on, Cogito releases will appear at
>
>	http://www.kernel.org/pub/software/scm/cogito
>
>or
>
>	ftp://ftp.kernel.org/pub/software/scm/cogito
>
>  Please update your bookmarks, if you have any.
>
>  This will hopefully make me fit to my bandwidth limit until the end of
>the month, and should make things significantly faster for you. Thanks a
>lot to the kernel.org folks who made it possible!
>
>  
>
I'd suggest you'd also update the rsync pointer at
http://kernel.org/pub/software/scm/cogito/README


^ permalink raw reply

* blowing chunks (quick update)
From: C. Scott Ananian @ 2005-04-22 21:02 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Linus Torvalds
In-Reply-To: <Pine.LNX.4.58.0504201510520.6467@ppc970.osdl.org>

Just a quick status update on my chunking work: I've got working code (see 
attached, but it's very rough) and am tweaking the various knobs and dials 
to try to get reasonable space savings.  I can save a modest amount of 
space using a 64k chunk size, but it's surprisingly hard to get 
substantial reductions.  Not because there's not a lot of redundancy in 
the archive --- there is --- but because 'gzip -9' is damn good at 
compressing source code.  Splitting files up into chunks hampers the 
compression, causing more 'extra space' to be used that the chunking 
format by itself would seem to indicate. [The patch below pre-seeds zlib's 
dictionary for non-leaf chunks to help mitigate this.]

I'm not giving up yet: there are a number of tricks left I'd like to play.
This is just a quick update to let y'all know I'm still hard at work.
  --scott [I'll also be off-line until Sunday.]

Khaddafi tonight Shoal Bay D5 SLBM LCPANGS ODIBEX Cheney ammunition 
counter-intelligence Japan overthrow Chechnya Mossad explosion ZRBRIEF
                          ( http://cscott.net/ )

diff -ruHp -x .dircache -x .git -x '*.o' -x '*~' -x 'blow-chunks.?.*' git.repo.orig/Makefile git.repo/Makefile
--- git.repo.orig/Makefile	2005-04-21 15:45:39.000000000 -0400
+++ git.repo/Makefile	2005-04-21 17:46:10.000000000 -0400
@@ -8,6 +8,7 @@
  # break unless your underlying filesystem supports those sub-second times
  # (my ext3 doesn't).
  CFLAGS=-g -O3 -Wall
+CFLAGS=-g -Wall

  CC=gcc
  AR=ar
@@ -16,16 +17,17 @@ AR=ar
  PROG=   update-cache show-diff init-db write-tree read-tree commit-tree \
  	cat-file fsck-cache checkout-cache diff-tree rev-tree show-files \
  	check-files ls-tree merge-base merge-cache unpack-file git-export \
-	diff-cache convert-cache
+	diff-cache convert-cache \
+	chunktest blow-chunks chunk-size chunk-ref

  all: $(PROG)

  install: $(PROG)
  	install $(PROG) $(HOME)/bin/

-LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o
+LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o chunk.o
  LIB_FILE=libgit.a
-LIB_H=cache.h object.h
+LIB_H=cache.h object.h chunk.h

  $(LIB_FILE): $(LIB_OBJS)
  	$(AR) rcs $@ $(LIB_OBJS)
@@ -91,6 +93,16 @@ diff-cache: diff-cache.o $(LIB_FILE)
  convert-cache: convert-cache.o $(LIB_FILE)
  	$(CC) $(CFLAGS) -o convert-cache convert-cache.o $(LIBS)

+chunktest: chunktest.o $(LIB_FILE)
+	$(CC) $(CFLAGS) -o $@ $< $(LIBS)
+
+blow-chunks: blow-chunks.o $(LIB_FILE)
+	$(CC) $(CFLAGS) -o $@ $< $(LIBS)
+chunk-ref: chunk-ref.o $(LIB_FILE)
+	$(CC) $(CFLAGS) -o $@ $< $(LIBS)
+chunk-size: chunk-size.o $(LIB_FILE)
+	$(CC) $(CFLAGS) -o $@ $< $(LIBS)
+
  blob.o: $(LIB_H)
  cat-file.o: $(LIB_H)
  check-files.o: $(LIB_H)
diff -ruHp -x .dircache -x .git -x '*.o' -x '*~' -x 'blow-chunks.?.*' git.repo.orig/fsck-cache.c git.repo/fsck-cache.c
--- git.repo.orig/fsck-cache.c	2005-04-21 15:45:41.000000000 -0400
+++ git.repo/fsck-cache.c	2005-04-21 17:38:33.000000000 -0400
@@ -6,6 +6,7 @@
  #include "commit.h"
  #include "tree.h"
  #include "blob.h"
+#include "chunk.h"

  #define REACHABLE 0x0001

@@ -88,11 +89,16 @@ static int fsck_name(char *hex)
  			void *buffer = unpack_sha1_file(map, mapsize, type, &size);
  			if (!buffer)
  				return -1;
+			if (TYPE_IS_TREAP(type))
+			    /* xxx: we really should check chunk structure */
+			    buffer = chunk_read_sha1_file(sha1, type, &size, buffer);
  			if (check_sha1_signature(sha1, buffer, size, type) < 0)
  				printf("sha1 mismatch %s\n", sha1_to_hex(sha1));
  			munmap(map, mapsize);
-			if (!fsck_entry(sha1, type, buffer, size))
-				return 0;
+			if (fsck_entry(sha1, type, buffer, size) < 0)
+				return -1;
+			free(buffer);
+			return 0;
  		}
  	}
  	return -1;
diff -ruHp -x .dircache -x .git -x '*.o' -x '*~' -x 'blow-chunks.?.*' git.repo.orig/sha1_file.c git.repo/sha1_file.c
--- git.repo.orig/sha1_file.c	2005-04-21 15:45:41.000000000 -0400
+++ git.repo/sha1_file.c	2005-04-21 22:12:40.000000000 -0400
@@ -8,6 +8,7 @@
   */
  #include <stdarg.h>
  #include "cache.h"
+#include "chunk.h"

  const char *sha1_file_directory = NULL;

@@ -120,7 +121,7 @@ void * unpack_sha1_file(void *map, unsig
  {
  	int ret, bytes;
  	z_stream stream;
-	char buffer[8192];
+	char buffer[SMALL_FILE_LIMIT];
  	char *buf;

  	/* Get the data stream */
@@ -132,10 +133,15 @@ void * unpack_sha1_file(void *map, unsig

  	inflateInit(&stream);
  	ret = inflate(&stream, 0);
+	//if (ret==Z_NEED_DICT) ...
  	if (sscanf(buffer, "%10s %lu", type, size) != 2)
  		return NULL;
-
  	bytes = strlen(buffer) + 1;
+
+	/* Tricky optimization; avoid encoding size for small files. */
+	if (ret==Z_STREAM_END && *size==0)
+	    *size = stream.total_out - bytes;
+
  	buf = malloc(*size);
  	if (!buf)
  		return NULL;
@@ -161,6 +167,8 @@ void * read_sha1_file(const unsigned cha
  	if (map) {
  		buf = unpack_sha1_file(map, mapsize, type, size);
  		munmap(map, mapsize);
+		if (TYPE_IS_TREAP(type))
+		    return chunk_read_sha1_file(sha1, type, size, buf);
  		return buf;
  	}
  	return NULL;
@@ -215,6 +223,7 @@ int write_sha1_file(char *buf, unsigned

  	if (write(fd, compressed, size) != size)
  		die("unable to write file");
+	free(compressed);
  	close(fd);

  	return 0;
@@ -259,9 +268,11 @@ int write_sha1_buffer(const unsigned cha
  		if (collision_check(filename, buf, size))
  			return error("SHA1 collision detected!"
  					" This is bad, bad, BAD!\a\n");
+		errno = EEXIST; /* indicate to caller that this exists */
  		return 0;
  	}
  	write(fd, buf, size);
  	close(fd);
+	errno = 0;
  	return 0;
  }
diff -ruHp -x .dircache -x .git -x '*.o' -x '*~' -x 'blow-chunks.?.*' git.repo.orig/update-cache.c git.repo/update-cache.c
--- git.repo.orig/update-cache.c	2005-04-21 15:45:41.000000000 -0400
+++ git.repo/update-cache.c	2005-04-20 17:32:07.000000000 -0400
@@ -4,6 +4,7 @@
   * Copyright (C) Linus Torvalds, 2005
   */
  #include "cache.h"
+#include "chunk.h"

  /*
   * Default to not allowing changes to the list of files. The
@@ -23,6 +24,7 @@ static int index_fd(unsigned char *sha1,
  	void *metadata = malloc(200);
  	int metadata_size;
  	void *in;
+	int retval, err;
  	SHA_CTX c;

  	in = "";
@@ -51,6 +53,7 @@ static int index_fd(unsigned char *sha1,
  	stream.avail_out = max_out_bytes;
  	while (deflate(&stream, 0) == Z_OK)
  		/* nothing */;
+	free(metadata);

  	/*
  	 * File content
@@ -62,7 +65,11 @@ static int index_fd(unsigned char *sha1,

  	deflateEnd(&stream);

-	return write_sha1_buffer(sha1, out, stream.total_out);
+	retval = write_sha1_buffer(sha1, out, stream.total_out);
+	err = errno;
+	free(out);
+	errno = err;
+	return retval;
  }

  /*
@@ -113,7 +120,7 @@ static int add_file_to_cache(char *path)
  	ce->ce_mode = create_ce_mode(st.st_mode);
  	ce->ce_flags = htons(namelen);

-	if (index_fd(ce->sha1, fd, &st) < 0)
+	if (chunk_index_fd(ce->sha1, fd, &st) < 0)
  		return -1;

  	return add_cache_entry(ce, allow_add);
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/blow-chunks.c	2005-04-20 21:33:33.000000000 -0400
@@ -0,0 +1,34 @@
+#include <stdlib.h>
+#include "cache.h"
+#include "chunk.h"
+
+/* For every file on the command-line, if it is a blob, convert it to a chunk.
+ */
+void convert_one(char *filename) {
+    char type[10];
+    int fd = open(filename, O_RDONLY);
+    struct stat st;
+    void *map, *buf;
+    unsigned long size;
+    if (fd < 0) { perror(filename); return; }
+    if (fstat(fd, &st) < 0) { perror(filename); close(fd); return; }
+    map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+    close(fd);
+    if (map == MAP_FAILED) { perror("mmap failed"); return; }
+    buf = unpack_sha1_file(map, st.st_size, type, &size);
+    munmap(map, st.st_size);
+    if (buf == NULL) { perror("Couldn't open file"); return; }
+    if (strcmp(type, "blob")==0) {
+	unsigned char sha1[20];
+	/* a-ha! */
+	chunkify_blob(buf, size, sha1);
+    }
+    free(buf);
+}
+
+int main(int argc, char **argv) {
+    int i;
+    for (i=1; i<argc; i++)
+	convert_one(argv[i]);
+    return 0;
+}
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/chunk-ref.c	2005-04-21 17:45:10.000000000 -0400
@@ -0,0 +1,44 @@
+#include <stdlib.h>
+#include "cache.h"
+#include "chunk.h"
+
+void ref_one(char *filename, char *find_parent) {
+    char type[10];
+    int fd = open(filename, O_RDONLY);
+    struct stat st;
+    void *map, *buf;
+    unsigned long size;
+    if (fd < 0) { perror(filename); return; }
+    if (fstat(fd, &st) < 0) { perror(filename); close(fd); return; }
+    map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+    close(fd);
+    if (map == MAP_FAILED) { perror("mmap failed"); return; }
+    buf = unpack_sha1_file(map, st.st_size, type, &size);
+    munmap(map, st.st_size);
+    if (buf == NULL) { perror("Couldn't open file"); return; }
+    if (TYPE_IS_TREAP(type)) {
+	int num_children = 0, i;
+	if (TREAP_HAS_LEFT(type)) num_children++;
+	if (TREAP_HAS_RIGHT(type)) num_children++;
+	for (i=0; i<num_children; i++) {
+	    size -= 20;
+	    if (find_parent) {
+		if (strcmp(sha1_file_name(buf + size), find_parent)==0)
+		    printf("%s\n", filename);
+	    } else
+		printf("%s -> %s\n", filename, sha1_file_name(buf + size));
+	}
+    }
+    free(buf);
+}
+
+/* If you provide a filename followed by '--' on the command line, will print
+ * all of the given chunks which are parents of that chunk.   Else, print all
+ * children of the given chunks. */
+int main(int argc, char **argv) {
+    char *parent = (argc>2) && (strcmp("--", argv[2])==0) ? argv[1] : NULL;
+    int i = parent ? 3 : 1;
+    for (; i<argc; i++)
+	ref_one(argv[i], parent);
+    return 0;
+}
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/chunk-size.c	2005-04-21 17:41:00.000000000 -0400
@@ -0,0 +1,34 @@
+#include <stdlib.h>
+#include "cache.h"
+#include "chunk.h"
+
+/* For every file on the command-line, if it is a blob, convert it to a chunk.
+ */
+void size_one(char *filename) {
+    char type[10];
+    int fd = open(filename, O_RDONLY);
+    struct stat st;
+    void *map, *buf;
+    unsigned long size;
+    if (fd < 0) { perror(filename); return; }
+    if (fstat(fd, &st) < 0) { perror(filename); close(fd); return; }
+    map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+    close(fd);
+    if (map == MAP_FAILED) { perror("mmap failed"); return; }
+    buf = unpack_sha1_file(map, st.st_size, type, &size);
+    munmap(map, st.st_size);
+    if (buf == NULL) { perror("Couldn't open file"); return; }
+    if (TYPE_IS_TREAP(type)) {
+	if (TREAP_HAS_LEFT(type)) size-=20;
+	if (TREAP_HAS_RIGHT(type)) size-=20;
+	printf("%s %lu\n", filename, size);
+    }
+    free(buf);
+}
+
+int main(int argc, char **argv) {
+    int i;
+    for (i=1; i<argc; i++)
+	size_one(argv[i]);
+    return 0;
+}
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/chunk.c	2005-04-22 12:30:59.000000000 -0400
@@ -0,0 +1,640 @@
+/*
+ * This file implements a treap-based chunked content store.  The
+ * idea is that every stored file is broken down into tree-structured
+ * chunks (that is, every chunk has an optional 'prefix' and 'suffix'
+ * chunk), and these chunks are put in the object store.  This way
+ * similar files will be expected to share chunks, saving space.
+ * Files less than one disk block long are expected to fit in a single
+ * chunk, so there is no extra indirection overhead for this case.
+ *
+ * Copyright (C) 2005 C. Scott Ananian <cananian@alumni.princeton.edu>
+ */
+
+/*
+ * We assume that the file and the chunk information all fits in memory.
+ * A slightly more-clever implementation would work even if the file
+ * didn't fit.  Basically, we could scan it an keep the
+ * 'N' lowest heap keys (chunk hashes), where 'N' is chosen to fit
+ * comfortably in memory.  These would form the root and top
+ * of the resulting treap, constructing it top-down.  Then we'd scan
+ * again any only keep the next 'N' lowest heap keys, etc.
+ *
+ * But we're going to keep things simple.  We do try to maintain locality
+ * where possible, so if you need to swap things still shouldn't be too bad.
+ */
+
+#include <assert.h>
+#include <stdlib.h>
+#include "cache.h"
+#include "chunk.h"
+
+typedef unsigned long ch_size_t;
+
+/* Our magic numbers: these can be tuned without breaking files already
+ * in the archive, although space re-use is only expected between files which
+ * have these constants set to the same values. */
+
+/* The window size determines how much context we use when looking for a
+ * chunk boundary.
+ * C source has approx 5 bits per character of entropy.
+ * We'd like to get 32 bits of good entropy into our boundary checksum;
+ * that means 7 bytes is a rough minimum for the window size.
+ * 30 bytes is what 'rsyncable zlib' uses; that should be fine. */
+#define ROLLING_WINDOW 67
+/* The ideal chunk size will fit most chunks into a disk block.  A typical
+ * disk block size is 4k, and we expect (say) 50% compression. */
+/* some primes: 61 127 251 509 1021 2039 4091 8191 16381 32749 65521 */
+//#define CHUNK_SIZE 7901 /* primes are nice to use */
+#define CHUNK_SIZE 16381
+
+#define WINDOW_MAGIC 0x0000 /* aka, never */
+
+/* Data structures: */
+struct chunk {
+    /* a chunk represents some range of the underlying file */
+    ch_size_t start /* inclusive */, end /*exclusive*/;
+    unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */
+};
+struct chunklist {
+    /* a dynamically-sized list of chunks */
+    struct chunk *chunk; /* an array of chunks */
+    ch_size_t num_items; /* how many items are currently in the list */
+    ch_size_t allocd;    /* how many items we've allocated space for */
+};
+struct treap {
+    /* A treap node represents a run of consecutive chunks. */
+
+    /* the start and end of the run: */
+    ch_size_t start /* inclusive */, end /*exclusive*/;
+    struct chunk *chunk; /* some chunk in the run. */
+    /* treaps representing the run before 'chunk' (left) and
+     * after 'chunk' (right).  */
+    struct treap *left, *right;
+    /* sha1 for the run represented by this treap */
+    unsigned char sha1[20];
+};
+
+static struct chunklist *
+create_chunklist(int expected_items) {
+    struct chunklist *cl = malloc(sizeof(*cl));
+    assert(expected_items > 0);
+    cl->num_items = 0;
+    cl->allocd = expected_items;
+    cl->chunk = malloc(sizeof(cl->chunk[0]) * cl->allocd);
+    return cl;
+}
+static void
+free_chunklist(struct chunklist *cl) {
+    free(cl->chunk);
+    free(cl);
+}
+
+/* Add a chunk to the chunk list, calculating its SHA1 in the process. */
+/* The chunk includes buf[start] to buf[end-1].                        */
+static void
+add_chunk(struct chunklist *cl, char *buf, ch_size_t start, ch_size_t end) {
+    struct chunk *ch;
+    SHA_CTX c;
+    assert(start<end); assert(cl); assert(buf);
+    if (cl->num_items >= cl->allocd) {
+	cl->allocd *= 2;
+	cl->chunk = realloc(cl->chunk, cl->allocd * sizeof(*(cl->chunk)));
+    }
+    assert(cl->num_items < cl->allocd);
+    ch = cl->chunk + (cl->num_items++);
+    ch->start = start;
+    ch->end = end;
+    /* compute SHA-1 of the chunk. */
+    SHA1_Init(&c);
+    SHA1_Update(&c, buf+start, end-start);
+    SHA1_Final(ch->sha1, &c);
+    /* done! */
+}
+
+/* Split a buffer into chunks, using an adler-32 checksum over ROLLING_WINDOW
+ * bytes to determine chunk boundaries.  We try to split chunks into pieces
+ * whose size averages out to be 'CHUNK_SIZE' (nice if this is a prime).*/
+static void
+chunkify(struct chunklist *cl, char *buf, ch_size_t size) {
+    int i, adler_s1=1, adler_s2=0, last=-1;
+
+    for (i=0; i<size; i++) {
+	if (i >= ROLLING_WINDOW) { /* After window is full: */
+	    /* Old character out */
+	    adler_s1 = (65521 + adler_s1 - (unsigned char)buf[i-ROLLING_WINDOW]) % 65521;
+	    adler_s2 = (65521 + adler_s2 - ROLLING_WINDOW * (unsigned char)buf[i-ROLLING_WINDOW]) % 65521;
+	}
+	/* New character in */
+	adler_s1 = (adler_s1 + (unsigned char)buf[i]) % 65521;
+	adler_s2 = (adler_s2 + adler_s1) % 65521;
+	/* Is this the end of a chunk? */
+	if (WINDOW_MAGIC == ((adler_s1 + adler_s2*65536) % CHUNK_SIZE)) {
+	    add_chunk(cl, buf, last+1, i+1);
+	    last = i;
+	    //adler_s1 = 1; adler_s2 = 0; /* reset window */
+	}
+    }
+    /* One last chunk at the end: */
+    if (last+1!=size)
+	add_chunk(cl, buf, last+1, size);
+    /* done! */
+}
+
+/* A treap is a 'heap-ordered tree'.  There are two constraints maintained:
+ *   left tree key < this tree key < right tree key
+ * and
+ *   this heap key < left and right heap keys.
+ * We use the sha1 of the chunk (chunk->sha1) as the heap key and the
+ * file location (chunk->start) as the tree key.
+ * For more info on treaps, see:
+ *   C. R. Aragon and R. G. Seidel, "Randomized search trees",
+ *   Proc. 30th IEEE FOCS (1989), 540-545.
+ * There are many possible binary trees we could build; enforcing the
+ * heap constraint ensures that similar files will build similar trees.
+ * (The root of the constructed tree will always be the chunk with the
+ *  smallest hash key; it's left child will be the chunk with the smallest
+ *  hash among those chunk before the root in file order; and so on
+ *  recursively.)
+ */
+
+/* Compare the 'heap keys' of two chunks. */
+static int
+chunk_hash_cmp(struct chunk *c1, struct chunk *c2) {
+    int c = memcmp(c1->sha1, c2->sha1, sizeof(c1->sha1));
+    if (c!=0) return c;
+    /* Use file location to break ties (caused by repeated content w/in
+     * a single file).  This ensures that our heap keys are unique. */
+    return c1->start - c2->start;
+}
+
+/* Assertion helper: check tree and heap constraints. */
+static int
+treap_valid(struct treap *t) {
+    if (!t) return 1;
+    if (t->chunk==NULL) return 0;
+    if (t->left!=NULL) {
+	/* Tree constraint. */
+	assert(t->left->chunk->start < t->chunk->start);
+	/* Heap constraint. */
+	assert(chunk_hash_cmp(t->chunk, t->left->chunk) < 0);
+	/* 'start' validity */
+	assert(t->start == t->left->start);
+    } else
+	assert(t->start == t->chunk->start);
+    if (t->right!=NULL) {
+	/* Tree constraint. */
+	assert(t->chunk->start < t->right->chunk->start);
+	/* Heap constraint. */
+	assert(chunk_hash_cmp(t->chunk, t->right->chunk) < 0);
+	/* 'end' validity. */
+	assert(t->end == t->right->end);
+    } else
+	assert(t->end == t->chunk->end);
+    return 1;
+}
+
+/* Restore heap constraint without disturbing tree ordering. */
+/* Only the root of the given treap will violate the heap constraint. */
+static struct treap *
+treapify(struct treap *t) {
+    struct treap *x, *y, *a, *b, *c;
+    int left_ok, right_ok, rotate_left;
+    assert(treap_valid(t->left));
+    assert(treap_valid(t->right));
+    left_ok = (t->left == NULL) ||
+	(chunk_hash_cmp(t->chunk, t->left->chunk) < 0);
+    right_ok = (t->right == NULL) ||
+	(chunk_hash_cmp(t->chunk, t->right->chunk) < 0);
+    if (left_ok && right_ok) { /* well, that's easy */
+	assert(treap_valid(t));
+	return t;
+    }
+    /* okay, someone needs to rotate */
+    rotate_left = (!left_ok) &&
+	(right_ok || /* if neither is okay, then rotate smallest up */
+	 chunk_hash_cmp(t->left->chunk, t->right->chunk) < 0);
+    /*   Rotation:
+     *     y   -bring left up->  x 
+     *    / \                   / \
+     *   x   c                 a   y
+     *  / \                       / \
+     * a   b <-bring right up-   b   c
+     */
+    if (rotate_left) {
+	y = t;  x = y->left;  c = y->right;  a = x->left;  b = x->right;
+	y->left = b;
+	y->right = c;
+	y->start = y->left ? y->left->start : y->chunk->start;
+	y->end = y->right ? y->right->end : y->chunk->end;
+	x->left = a;
+	x->right = treapify(y); // recurse to check heap constraint
+	x->start = x->left ? x->left->start : x->chunk->start;
+	x->end = x->right ? x->right->end : x->chunk->end;
+	assert(treap_valid(x));
+	return x;
+    } else {
+	x = t;  a = x->left;  y = x->right;  b = y->left;  c = y->right;
+	x->left = a;
+	x->right = b;
+	x->start = x->left ? x->left->start : x->chunk->start;
+	x->end = x->right ? x->right->end : x->chunk->end;
+	y->right = c;
+	y->left = treapify(x); // recurse to check heap constraint.
+	y->start = y->left ? y->left->start : y->chunk->start;
+	y->end = y->right ? y->right->end : y->chunk->end;
+	assert(treap_valid(y));
+	return y;
+    }
+}
+
+/* Use list of chunks to build treap bottom-up, calling treapify to
+ * restore heap order on the subtree after we add each interior node.
+ * This is O(N), where N is the number of chunks. */
+static struct treap *
+build_treap(struct chunklist *cl, int chunk_st, int chunk_end) {
+    struct treap *result;
+    /* Some treaps are trivial to build: */
+    if (chunk_st >= chunk_end) return NULL;
+    /* Claim a chunk in the middle for ourself. */
+    int c = (chunk_st + chunk_end)/2;
+    result = (struct treap *)malloc(sizeof(*result));
+    result->chunk = &(cl->chunk[c]);
+    /* Divide and conquer: build well-formed treaps for our kids.*/
+    result->left = build_treap(cl, chunk_st, c);
+    result->right = build_treap(cl, c+1, chunk_end);
+    result->start = result->left ? result->left->start : result->chunk->start;
+    result->end = result->right ? result->right->end : result->chunk->end;
+    /* Now we need to ensure that the heap constraint is satisfied; that is,
+     * result->chunk->sha1 < result->left->chunk->sha1  and
+     * result->chunk->sha1 < result->right->chunk->sha1.
+     */
+    assert(treap_valid(result->left));
+    assert(treap_valid(result->right));
+    return treapify(result);
+}
+
+static void
+free_treap(struct treap *t) {
+    if (!t) return;
+    free_treap(t->left);
+    free_treap(t->right);
+    free(t);
+}
+
+static int
+treap_depth(struct treap *t) {
+    int l, r;
+    if (!t) return 0;
+    l = treap_depth(t->left);
+    r = treap_depth(t->right);
+    return 1 + ((l > r) ? l : r);
+}
+
+/* Fill in the treap hashes.  This will be O(N ln M), where N is the
+ * file length and M is the number of chunks.  We could actually do
+ * this in 2*N time if the subtree hashes were prefix-identical.
+ * Since we need to include the chunk length in the hash prefix,
+ * we can't reuse the hashing context and we need to pay the extra
+ * O(ln M) factor. */
+static void
+do_treap_hash(struct treap *t, void *data, SHA_CTX *accum, int accum_len) {
+    char prefix[200];
+    SHA_CTX *cp;
+    int i;
+
+    assert(treap_valid(t));
+    if (!t) return;
+
+    /* Start a new treap context. */
+    cp = &(accum[accum_len++]);
+    SHA1_Init(cp);
+    /* Sticking the size in the prefix makes me unhappy. =( */
+    SHA1_Update(cp, prefix, 1+sprintf(prefix, "blob %lu", t->end - t->start));
+    /* Recurse on the left. */
+    do_treap_hash(t->left, data, accum, accum_len);
+    /* Add in our chunk. */
+    for (i=0; i<accum_len; i++)
+	SHA1_Update(accum + i, data + t->chunk->start,
+		    t->chunk->end - t->chunk->start);
+    /* Recurse on the right. */
+    do_treap_hash(t->right, data, accum, accum_len);
+    /* Finalize and write it to t->sha1. */
+    SHA1_Final(t->sha1, cp);
+    /* Done! */
+}
+/* Helper method. */
+static void
+compute_treap_hashes(struct treap *t, void *data) {
+    /* Allocate space for each level of the treap to have its own context. */
+    SHA_CTX contexts[treap_depth(t)];
+    do_treap_hash(t, data, contexts, 0);
+}
+/* Yuck. */
+static const char *
+compute_null_treap_hash() {
+    static const char fixed[] = { "blob 0" };
+    static char sha1[20], *cp=NULL;
+    SHA_CTX c;
+    if (cp) return cp;
+    SHA1_Init(&c);
+    SHA1_Update(&c, fixed, sizeof(fixed));
+    SHA1_Final(sha1, &c);
+    cp = sha1;
+    return cp;
+}
+
+
+/* Now that we've broken it down into treap-structured pieces, let's write
+ * them to the object store. */
+
+/* Write a single treap piece to the object store.  Note that 't' may be
+ * NULL for the special case of a zero-byte file.  Writes the hash of
+ * this piece back to 'sha1', which must be non-NULL. Returns 0 on success.*/
+static int
+write_one(struct treap *t, char *buf) {
+/* two hundred bytes is two 20-byte SHA1 hashes, two presence bytes,
+ * six bytes of type, one null, and plus 10^151 file length. (Conservative.) */
+#define MAX_METADATA_LEN 200
+    z_stream stream;
+    ch_size_t max_out_bytes;
+    ch_size_t chunk_size = t ? (t->chunk->end - t->chunk->start) : 0;
+    ch_size_t content_size, metadata_size;
+    char metadata[MAX_METADATA_LEN];
+    void *out;
+ 
+    /* Calcuate size, create type tag. */
+    content_size = chunk_size;
+    if (t && t->left) content_size += sizeof(t->left->sha1);
+    if (t && t->right) content_size += sizeof(t->right->sha1);
+    metadata_size = 1+sprintf
+	(metadata, "%c %lu", TREAP_TAG(t&&t->left,t&&t->right),
+	 /* optimize saving small files by skipping the 'length' field. */
+	 ((content_size + MAX_METADATA_LEN) < SMALL_FILE_LIMIT) ? 0 :
+	 content_size);
+ 
+    memset(&stream, 0, sizeof(stream));
+    deflateInit(&stream, Z_BEST_COMPRESSION);
+    max_out_bytes = deflateBound(&stream, content_size+metadata_size);
+    out = malloc(max_out_bytes);
+    stream.next_out = out;
+    stream.avail_out = max_out_bytes;
+
+    /* Use left subtree as dictionary to improve compression. */
+    if (t && t->start < t->chunk->start)
+	deflateSetDictionary
+	    (&stream, buf + t->start, t->chunk->start - t->start);
+
+    /* 
+     * Metadata: Type, ASCII size, null byte.
+     */
+    stream.next_in = metadata;
+    stream.avail_in = metadata_size;
+    while (deflate(&stream, 0) == Z_OK)
+	    /* nothing */;
+
+    /*
+     * Chunk content.
+     */
+    stream.next_in = buf + ( t ? t->chunk->start : 0);
+    stream.avail_in = chunk_size; /* possibly zero */
+    while (deflate(&stream, 0) == Z_OK)
+	/* nothing */;
+
+    /*
+     * Append uncompressed hashes to the end.
+     */
+    if (t && (t->left || t->right))
+	/* This is random data; it just expands if you try to compress it. */
+	deflateParams(&stream, Z_NO_COMPRESSION, Z_DEFAULT_STRATEGY);
+    if (t && t->left) { /* left hash */
+	stream.next_in = t->left->sha1;
+	stream.avail_in = sizeof(t->left->sha1);
+	while (deflate(&stream, 0) == Z_OK)
+	    /* nothing */;
+    }
+    if (t && t->right) { /* right hash */
+	stream.next_in = t->right->sha1;
+	stream.avail_in = sizeof(t->right->sha1);
+	while (deflate(&stream, 0) == Z_OK)
+	    /* nothing */;
+    }
+    /* Okay, finish up. */
+    stream.next_in = "";
+    stream.avail_in = 0;
+    while (deflate(&stream, Z_FINISH) == Z_OK)
+	/* nothing */;
+    deflateEnd(&stream);
+
+    return write_sha1_buffer(t ? (const char*) t->sha1 :
+			     compute_null_treap_hash(),
+			     out, stream.total_out);
+}
+
+/* Write all treap nodes to disk. */
+/* Return rightmost chunk in 'dict' if non-null. */
+static int
+write_treap(struct treap *t, char *buf, char *sha1_ret) {
+    const char *sha1 = t ? (const char*)t->sha1 : compute_null_treap_hash();
+    /* Provide sha1 to parent, if asked for. */
+    if (sha1_ret) memcpy(sha1_ret, sha1, sizeof(t->sha1));
+    /* Write us. */
+    if (write_one(t, buf) < 0)
+	return -1; /* failure. */
+    /* We don't need to write children if this already existed. */
+    if (errno == EEXIST) return 0;
+    /* No such luck.  Write our children. */
+    if (t && t->left)
+	if (write_treap(t->left, buf, NULL) < 0)
+	    return -1; /* failure. */
+    if (t && t->right)
+	if (write_treap(t->right, buf, NULL) < 0)
+	    return -1; /* failure. */
+    /* Now write us.  Note t may == NULL for a zero-byte file. */
+    /* Write back sha1, if wanted. */
+    errno = 0;
+    return 0;
+}
+
+static int
+chunky_write_buffer(unsigned char *sha1, void *buffer, unsigned long size,
+		    int force_write) {
+    struct chunklist *cl;
+    struct treap *t;
+    int st = 0;
+    /* We expect there to be 'file length / CHUNK_SIZE' chunks.  Over-estimate
+     * a little, and do the initial chunk list allocation. */
+    cl = create_chunklist(1 + ((3 * size) / (2 * CHUNK_SIZE)));
+    /* Split the file into chunks. */
+    chunkify(cl, buffer, size);
+    /* Build the treap. */
+    t = build_treap(cl, 0, cl->num_items);
+    assert(treap_valid(t));
+    /* Compute all the hashes. */
+    compute_treap_hashes(t, buffer);
+    /* Now write all the pieces, updating SHA1 for this file in the process. */
+    st = write_treap(t, buffer, sha1);
+    if (force_write && st==0 && errno == EEXIST)
+	if (unlink(sha1_file_name(sha1))==0)
+	    st = write_treap(t, buffer, sha1);
+    /* Free everything; we're done. */
+    free_treap(t);
+    free_chunklist(cl);
+    return st;
+}
+
+/* EXPORTED FUNCTION: write the file open on file descriptor 'fd'
+ * and described by 'ce' and 'st' to the object store.   Return
+ * 0 on success, -1 on failure. */
+/* This does the same thing as 'index_fd' in Linus' update-cache.c */
+int
+chunk_index_fd(unsigned char *sha1, int fd, struct stat *st) {
+    char *in; int rc;
+
+    in = "";
+    if (st->st_size)
+	in = mmap(NULL, st->st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+    close(fd);
+    if (in==MAP_FAILED) return -1;
+
+    rc = chunky_write_buffer(sha1, in, st->st_size, 0/* don't force write*/);
+
+    if (st->st_size)
+	munmap(in, st->st_size);
+
+    return rc;
+}
+
+/*** A similar function: this just chunkifies an existing blob. */
+void
+chunkify_blob(void *buffer, unsigned long size, unsigned char *sha1) {
+    unsigned char t[20];
+    chunky_write_buffer(sha1?sha1:t, buffer, size, 1/*force write*/);
+}
+
+
+/*** Functions to read a chunked file into a contiguous buffer. ***/
+
+struct read_chunk {
+    void *chunk_data;
+    ch_size_t chunk_size, total_size;
+    struct read_chunk *left, *right;
+};
+static struct read_chunk *
+read_chunk(const unsigned char *sha1);
+
+static struct read_chunk *
+read_chunk_chunk(const unsigned char *sha1, const char *type, ch_size_t size, void *data) {
+    struct read_chunk *result = malloc(sizeof(*result));
+ 
+    /* Parse the chunk data. */
+    result->left = result->right = NULL;
+    result->chunk_data = data;
+    result->chunk_size = size;
+    if (TREAP_HAS_LEFT(type)) {
+	result->chunk_size -= 20;
+	result->left = read_chunk(result->chunk_data + result->chunk_size);
+	if (!result->left) return NULL; /* error! */
+    }
+    if (TREAP_HAS_RIGHT(type)) {
+	result->chunk_size -= 20;
+	result->right = read_chunk(result->chunk_data + result->chunk_size);
+	if (!result->right) return NULL; /* error! */
+    }
+    result->total_size = result->chunk_size +
+	(result->left ? result->left->total_size : 0) +
+	(result->right ? result->right->total_size : 0);
+    return result;
+}
+static struct read_chunk *
+read_chunk_blob(const unsigned char *sha1, void *data, ch_size_t size) {
+    struct read_chunk *result = malloc(sizeof(*result));
+    result->chunk_data = data;
+    result->chunk_size = result->total_size = size;
+    result->left = result->right = NULL;
+    return result;
+}
+static struct read_chunk *
+read_chunk(const unsigned char *sha1) {
+    void *data;
+    ch_size_t size;
+    char type[10];
+    /* This used to be:
+     * data = read_sha1_file(sha1, type, &size);
+     * But we hacked read_sha1_file to transparently decompress chunks.
+     * So now we need to duplicate a little bit of code. */
+    void *map; unsigned long mapsize;
+    map = map_sha1_file(sha1, &mapsize);
+    if (!map) return NULL;
+    data = unpack_sha1_file(map, mapsize, type, &size);
+    munmap(map, mapsize);
+    /* End duplicate code. */
+    if (!data) return NULL;
+    /* Leaves may be blobs. */
+    if (strcmp(type, "blob")==0)
+	return read_chunk_blob(sha1, data, size);
+    assert(TYPE_IS_TREAP(type));
+    return read_chunk_chunk(sha1, type, size, data); 
+}
+static void
+copy_read_chunk(void *dest, struct read_chunk *rc) {
+    if (rc->left) {
+	copy_read_chunk(dest, rc->left);
+	dest += rc->left->total_size;
+    }
+    memcpy(dest, rc->chunk_data, rc->chunk_size);
+    if (rc->right)
+	copy_read_chunk(dest + rc->chunk_size, rc->right);
+}
+static void
+free_read_chunk(struct read_chunk *rc) {
+    if (rc->left) free_read_chunk(rc->left);
+    if (rc->right) free_read_chunk(rc->right);
+    free(rc->chunk_data);
+    free(rc);
+}
+
+/* This is called from 'read_sha1_file' in sha1_file.c as a
+ * 'post-processor' when a 'chunk' type file is found.  It will
+ * transparently stitch together the appropriate prefix and suffix
+ * chunks and pass the result off as a 'blob'. */
+void *
+chunk_read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size, void *chunkdata) {
+    struct read_chunk *rc;
+    void *result;
+    assert(TYPE_IS_TREAP(type));
+    /* This is a 'chunk' object; get the rest of the pieces. */
+    rc = read_chunk_chunk(sha1, type, *size, chunkdata);
+    if (!rc) return NULL; /* error! */
+    /* Now concatenate them together. */
+    strcpy(type, "blob");
+    *size = rc->total_size;
+    result = malloc(*size);
+    copy_read_chunk(result, rc);
+    /* done! */
+    free_read_chunk(rc);
+    return result;
+}
+
+#if 0
+/* Exercise this code. */
+int main(int argc, char **argv) {
+    struct cache_entry ce;
+    struct stat st;
+    char *buf, type[10];
+    unsigned long size;
+    int fd;
+    fd = open(argv[1], O_RDONLY);
+    if (fd < 0) exit(1);
+    if (fstat(fd, &st) < 0) exit(1);
+    if (chunk_index_fd(ce.sha1, fd, &st) < 0) exit(1);
+    printf("Wrote file %s.\n", sha1_to_hex(ce.sha1));
+    /* seemed to work! */
+    buf = read_sha1_file(ce.sha1, type, &size);
+    if (!buf) exit(1);
+    printf("Read file %s, of type %s (%lu bytes):\n",
+	   sha1_to_hex(ce.sha1), type, size);
+    fwrite(buf, size, 1, stdout);
+    /* done! */
+    return 0;
+}
+#endif
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/chunk.h	2005-04-21 22:11:53.000000000 -0400
@@ -0,0 +1,27 @@
+#ifndef CHUNK_H
+#define CHUNK_H
+
+#define TYPE_IS_TREAP(type) \
+   ({char*_ty=(type); _ty[0] >= '0' && _ty[0] <= '3' && !_ty[1];})
+
+#define TREAP_TAG(has_left,has_right) \
+   ('0' + ((has_left)?2:0) + ((has_right)?1:0))
+#define TREAP_HAS_LEFT(tag) ((tag)[0]&2)
+#define TREAP_HAS_RIGHT(tag) ((tag)[0]&1)
+
+extern int
+chunk_index_fd(unsigned char *sha1, int fd, struct stat *st);
+
+void *
+chunk_read_sha1_file(const unsigned char *sha1, char *type,
+		     unsigned long *size, void *result);
+
+void
+chunkify_blob(void *buffer, unsigned long size, unsigned char *sha1);
+
+/* Avoid encoding the file length explicitly for files smaller than this.
+ * Should always be large enough to hold all the file metadata (type, length
+ * in ASCII, and a null byte) at least. */
+#define SMALL_FILE_LIMIT 16384
+
+#endif /* CHUNK_H */
--- /dev/null	2005-04-20 20:21:45.319868048 -0400
+++ git.repo/chunktest.c	2005-04-20 14:49:46.000000000 -0400
@@ -0,0 +1,25 @@
+#include <stdlib.h>
+#include "cache.h"
+#include "chunk.h"
+
+/* Exercise this code. */
+int main(int argc, char **argv) {
+    struct cache_entry ce;
+    struct stat st;
+    char *buf, type[10];
+    unsigned long size;
+    int fd;
+    fd = open(argv[1], O_RDONLY);
+    if (fd < 0) exit(1);
+    if (fstat(fd, &st) < 0) exit(1);
+    if (chunk_index_fd(ce.sha1, fd, &st) < 0) exit(1);
+    printf("Wrote file %s.\n", sha1_to_hex(ce.sha1));
+    /* seemed to work! */
+    buf = read_sha1_file(ce.sha1, type, &size);
+    if (!buf) exit(1);
+    printf("Read file %s, of type %s (%lu bytes):\n",
+	   sha1_to_hex(ce.sha1), type, size);
+    fwrite(buf, size, 1, stdout);
+    /* done! */
+    return 0;
+}

^ permalink raw reply

* Re: [patch] fixup GECOS handling
From: Kyle Hayes @ 2005-04-22 20:46 UTC (permalink / raw)
  To: azarah; +Cc: Petr Baudis, GIT Mailing Lists
In-Reply-To: <1114196803.29271.52.camel@nosferatu.lan>

On Fri, 2005-04-22 at 21:06 +0200, Martin Schlemmer wrote:
> Right, but ';' is not cutoff on linux for one, and from what you said
> freebsd as well.  How about this rather (note that I assumed that the
> use of ';' as delimiter will be in the minority, but we can switch
> things around if it turns out the other way):

I'm not sure that __aix__ is defined, but it is close enough.  Someone
with an AIX compiler can correct it if needed.  Anyone know about HP-UX
and Tru64 and all those other ones?

Note that the original code also cuts on '.'.  Is that used by some *nix
in GECOS?

Best,
Kyle

> ----
> (not signed off, etc, as just for comments)
> 
> Index: commit-tree.c
> ===================================================================
> --- 5f61aecb06c2f2579bbb5951b1b53e0dedc434eb/commit-tree.c  (mode:100644 sha1:c0b07f89286c3f6cceae8122b4c3142c8efaf8e1)
> +++ uncommitted/commit-tree.c  (mode:100644)
> @@ -96,21 +96,6 @@
>                 if (!c)
>                         break;
>         }
> -
> -       /*
> -        * Go back, and remove crud from the end: some people
> -        * have commas etc in their gecos field
> -        */
> -       dst--;
> -       while (--dst >= p) {
> -               unsigned char c = *dst;
> -               switch (c) {
> -               case ',': case ';': case '.':
> -                       *dst = 0;
> -                       continue;
> -               }
> -               break;
> -       }
>  }
> 
>  static const char *month_names[] = {
> @@ -311,6 +296,17 @@
>         if (!pw)
>                 die("You don't exist. Go away!");
>         realgecos = pw->pw_gecos;
> +       /*
> +        * The GECOS fields are seperated via ',' on Linux, FreeBSD, etc,
> +        * and ';' on AIX.
> +        */
> +#if defined(__aix__)
> +       if (strchr(realgecos, ';'))
> +               *strchr(realgecos, ';') = 0;
> +#else
> +       if (strchr(realgecos, ','))
> +               *strchr(realgecos, ',') = 0;
> +#endif
>         len = strlen(pw->pw_name);
>         memcpy(realemail, pw->pw_name, len);
>         realemail[len] = '@';
> 
> 
-- 
Kyle Hayes <kyle@marchex.com>
Marchex Inc.


^ permalink raw reply

* Re: [PATCH] More docs
From: Junio C Hamano @ 2005-04-22 20:45 UTC (permalink / raw)
  To: David Greaves; +Cc: git
In-Reply-To: <4269557C.1050606@dgreaves.com>

>>>>> "DG" == David Greaves <david@dgreaves.com> writes:

DG> Removed Cogito stuff

DG>  commit-id

This is Cogito invention, not in the core.  Neither is tree-id.

DG>  ################################################################
DG> +diff-cache
DG> ...
DG> +-z
DG> +	/0 line termination on output

Write this either '\0' (for C literate) or NUL (ASCII character
name), please.  The same for other commands with -z.

DG>  diff-tree
DG> -	diff-tree [-r] [-z] <tree sha1> <tree sha1>
DG> +	diff-tree [-r] [-z] <tree/commit sha1> <tree/commit sha1>
DG> +...
DG> +--cached
DG> +	Cached only (private?)

What?  The beauty of diff-tree is it does not care about
dircache at all.  Maybe this is a Pasky addition, but I wonder
what the semantics of this option and why it is here...

DG>  ################################################################
DG>  read-tree
DG> -	read-tree [-m] <sha1>
DG> +	read-tree (<sha> | -m <sha1> [<sha2> <sha3>])"
DG> +
DG> +Reads the tree information given by <sha> into the directory cache,
DG> +but does not actually _update_ any of the files it "caches". (see:
DG> +checkout-cache)
DG> +
DG> +Optionally, it can merge a tree into the cache or perform a 3-way
DG> +merge.
DG> +
DG> +Trivial merges are done by read-tree itself.  Only conflicting paths
DG> +will be in unmerged state when read-tree returns.
DG> +...
DG> +NOTE NOTE NOTE! although read-tree coule do some of these nontrivial
DG> +merges, only the "matches in all three states" thing collapses by
DG> +default.

The above "NOTE" is taken from the initial message from Linus
but it is no longer true.  These days, it merges when:

    - stage 2 and 3 are the same
    - stage 1 and stage 2 are the same and stage 3 is different
    - stage 1 and stage 3 are the same and stage 2 is different

Originally it merged only when all stages are the same.

Also you do not describe the single tree merge ("read-tree -m
sha1").  Its semantics is:

    Operate as if the user did not specify "-m", but if the
    original cache had an entry for the same pathname already
    and the contents of the original matches with the tree being
    read, use the stat info from the original instead.

This is used to avoid unnecessary false hits when show-diff is
run after read-tree.

DG>  ################################################################
DG> @@ -151,8 +603,145 @@
DG>  show-files
DG>  	show-files [-z] [-t] (--[cached|deleted|others|ignored|stage])*
 
Although I like it, I do not think -t is in core.  It is Pasky.
Also you missed "show-files --unmerged".

    


^ permalink raw reply

* Re: Performance of various compressors
From: Aaron Lehmann @ 2005-04-22 20:38 UTC (permalink / raw)
  To: Mike Taht; +Cc: git
In-Reply-To: <426734DE.3040606@timesys.com>

On Wed, Apr 20, 2005 at 10:06:38PM -0700, Mike Taht wrote:
> That doing the compression at a level of 3, rather than the max of 9, 
> cuts the cpu time required for a big git commit by over half, and that 
> that actually translates into a win on the I/O to disk. (these tests 
> were performed on a dual opteron 842)

If (de)compression is slowing things down, you might want to check out
lzo (http://www.oberhumer.com/opensource/lzo/). I tested it on the
2.6.11 kernel source and found that lzo -7 output is only 2% larger
than gzip -3, but lzo decompression is almost 3 times faster. The
downside is that lzo took 5 times longer to perform the compression at
-7. Compression with lzo -3 is 3.5 times faster than gzip -3, but it
produces a file that's 37% bigger. Unfortunately, lzo has no settings
in between -3 and -7. I'd expect git to be more sensitive to
decompression speeds, though.

BTW, lzo decompression speed is not affected by the compression level.

^ permalink raw reply

* Re: wit 0.0.3 - a web interface for git available
From: Christian Meder @ 2005-04-22 20:35 UTC (permalink / raw)
  To: Greg KH; +Cc: Kay Sievers, git
In-Reply-To: <20050421073326.GA21772@kroah.com>

On Thu, 2005-04-21 at 00:33 -0700, Greg KH wrote:
> On Thu, Apr 21, 2005 at 03:28:27AM +0200, Kay Sievers wrote:
> > On Wed, Apr 20, 2005 at 10:42:53AM +0100, Christoph Hellwig wrote:
> > > On Tue, Apr 19, 2005 at 09:18:29PM -0700, Greg KH wrote:
> > > > On Wed, Apr 20, 2005 at 02:29:11AM +0200, Christian Meder wrote:
> > > > > Hi,
> > > > > 
> > > > > ok it's starting to look like spam ;-)
> > > > > 
> > > > > I uploaded a new version of wit to http://www.absolutegiganten.org/wit
> > > > 
> > > > Why not work together with Kay's tool:
> > > > 	http://ehlo.org/~kay/gitweb.pl?project=linux-2.6&action=show_log
> > > 
> > > That one looks really nice.  One major feature I'd love to see would
> > > be a show all diffs link for a changeset.
> > 
> > It's working now:
> >   http://ehlo.org/~kay/gitweb.pl
> > 
> > Many thanks to Christian Gierke for all the interface work, the nice
> > layout and the git logo. Thanks for the colored diff to Ken Brush.
> 
> Very nice, this looks great.  And hey, we have a git logo now :)

BTW is this logo already officially blessed ?


			Christian
-- 
Christian Meder, email: chris@absolutegiganten.org

The Way-Seeking Mind of a tenzo is actualized 
by rolling up your sleeves.

                (Eihei Dogen Zenji)


^ permalink raw reply

* Re: [PATCH] multi item packed files
From: Chris Mason @ 2005-04-22 20:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Krzysztof Halasa, git
In-Reply-To: <Pine.LNX.4.58.0504221230020.2344@ppc970.osdl.org>

On Friday 22 April 2005 15:43, Linus Torvalds wrote:
> On Fri, 22 Apr 2005, Chris Mason wrote:
> > The problem I see for git is that once you have enough data, it should
> > degrade over and over again somewhat quickly.
>
> I really doubt that.
>
> There's a more or less constant amount of new data added all the time: the
> number of changes does _not_ grow with history. The number of changes
> grows with the amount of changes going on in the tree, and while that
> isn't exactly constant, it definitely is not something that grows very
> fast.

>From a filesystem point of view, it's not the number of changes that matters, 
it's the distance between them.  The amount of new data is constant, but the 
speed of accessing the new data is affected by the bulk of old data on disk.

Even with defragging you hopefully end up with a big chunk of the disk where 
everything is in order.  Then you add a new file and it goes either somewhere 
behind that big chunk or in front of it.  The next new file might go 
somewhere behind or in front etc etc.  Having a big chunk just means the new 
files are likely to be farther apart making reads of the new data very seeky.

>
> Btw, this is how git is able to be so fast in the first place. Git is fast
> because it knows that the "size of the change" is a lot smaller than the
> "size of the repository", so it fundamentally at all points tries to make
> sure that it only ever bothers with stuff that has changed.
>
> Stuff that hasn't changed, it ignores very _very_ efficiently.
>
git as a write engine is very fast, and we definitely write more then we read.

> > I grabbed Ingo's tarball of 28,000 patches since 2.4.0 and applied them
> > all into git on ext3 (htree).  It only took ~2.5 hrs to apply.
>
> Ok, I'd actually wish it took even less, but that's still a pretty
> impressive average of three patches a second.

Yeah, and this was a relatively old machine with slowish drives.  One run to 
apply into my packed tree is finished and only took 2 hours.  But, I had 
'tuned' it to make bigger packed files, and the end result is 2MB compressed 
objects.    Great for compression rate, but my dumb format doesn't hold up 
well for reading it back.

If I pack every 64k (uncompressed), the checkout-tree time goes down to 3m14s.  
That's a very big difference considering how stupid my code is  .git was only 
20% smaller with 64k chunks.  I should be able to do better...I'll do one 
more run.

>
> > Anyway, I ended up with a 2.6GB .git directory.  Then I:
> >
> > rm .git/index
> > umount ; mount again
> > time read-tree `tree-id` (24.45s)
> > time checkout-cache --prefix=../checkout/ -a -f (4m30s)
> >
> > --prefix is neat ;)
>
> That sounds pretty acceptable. Four minutes is a long time, but I assume
> that the whole point of the exercise was to try to test worst-case
> behaviour.  We can certainly make sure that real usage gets lower numbers
> than that (in particular, my "real usage" ends up being 100% in the disk
> cache ;)

I had a tree with 28,000 patches.  If we pretend that one bk changeset will 
equal one git changeset, we'd have 64,000 patches (57k without empty 
mergesets), and it probably wouldn't fit into ram anymore ;)  Our bk cset 
rate was about 24k/year, so we'll have to trim very aggressively to have 
reasonable performance.

For a working tree that's fine, but we need some fast central place to pull 
the working .git trees from, and we're really going to feel the random io 
there.

-chris

^ permalink raw reply

* Re: Mozilla SHA1 implementation
From: Daniel Barkalow @ 2005-04-22 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, Edgar Toernig, Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0504220824480.2344@ppc970.osdl.org>

On Fri, 22 Apr 2005, Linus Torvalds wrote:

> But it's more likely the precompiled libssl. I'm not compiling the openssl
> thing myself, but just using the standard 0.9.7a version that comes with
> YDL. Which, btw, causes all of 
> 
> 	/lib/libcrypto.so.4

This is the one that actually has the SHA1 stuff, not libssl at all. You
can skip at least some of this by just using -lcrypto.

> 	/usr/lib/libgssapi_krb5.so.2
> 	/usr/lib/libkrb5.so.3
> 	/lib/libcom_err.so.2
> 	/usr/lib/libk5crypto.so.3
> 	/lib/libresolv.so.2
> 	/lib/libdl.so.2

	-Daniel
*This .sig left intentionally blank*


^ permalink raw reply

* Re: "GIT_INDEX_FILE" environment variable
From: Junio C Hamano @ 2005-04-22 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0504221147050.2344@ppc970.osdl.org>

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> ... The fact is, you can do _exactly_ what you are talking
LT> about by just wrapping the calls in

LT> 	( cd $WORKING_DIR && git-cmd )

LT> which simply doesn't have any downsides that I can see.

Almost, with a counter-example.  Please try this yourself:

  $ cd mozilla-sha1
  $ echo '/* garbage */' >>sha1.c
  $ sh -c 'cd .. && show-diff "$0" "$@"' sha1.c
  $ cd .. && show-diff mozilla-sha1/sha1.c

Some commands that take working tree relative paths do strange
things without the path munging I discussed in the original
message ("$R- prefixing") if you chdir to the $WORKING_DIR.  The
jit-update-cache wrapper I sent in the previous message is an
example of how Cogito layer can work it around.  It does not
break my "yuck" meter but I think it probably makes most people
barf ;-).  I was trying to make this path munging part easier
for the upper layer by making the core aware of WORKING_DIR.

Here is an updated set of commands that needs such path munging:

  check-files paths...
  show-diff [-R] [-q] [-s] [-z] [paths...]
  update-cache [--add] [--remove] [--refresh]
      [--cacheinfo mode blob-id] paths...
  checkout-cache [-f] [-a] paths...

That said, I do not think the above set is too many to warrant a
core surgery (I am agreeing with your conclusion here).  Unless
we also normalize path to support something like:

  $ cd mozilla-sha1
  $ echo '/* garbage */' >>cache.h
  $ sh -c 'cd .. && show-diff "$0" "$@"' ../cache.h

in the core, that is.


^ permalink raw reply

* [PATCH] More docs
From: David Greaves @ 2005-04-22 19:50 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

Removed Cogito stuff

Updated:
  checkout-cache
  commit-tree
  diff-cache
  diff-tree
  fsck-cache
  git-export
  init-db
  read-tree
  show-files
  update-cache
  write-tree

Thanks also to Junio C Hamano <junio@siamese.dyndns.org>
More eyes please :)

There are ???s where I'd _especially_ appreciate comments

Also note that much of this is simply edited emails from the list or 
comments cross checked with the source.


Signed-off-by: David Greaves <david@dgreaves.com>
---





[-- Attachment #2: README.reference.patch3 --]
[-- Type: text/plain, Size: 22687 bytes --]

Index: README.reference
===================================================================
--- 5f61aecb06c2f2579bbb5951b1b53e0dedc434eb/README.reference  (mode:100644 sha1:8186a561108d3c62625614272bd5e2f7d5826b4b)
+++ bc80a76ad52f0cdf28c4e10a2490da3558f5def6/README.reference  (mode:100644 sha1:810f4990448fbee57490a945d03a5e8a3eaadec0)
@@ -53,7 +53,7 @@
 
 ################################################################
 checkout-cache
-	checkout-cache [-q] [-a] [-f] [--] <file>...
+	checkout-cache [-q] [-a] [-f] [--prefix=<string>] [--] <file>...
 
 Will copy all files listed from the cache to the working directory
 (not overwriting existing files). Note that the file contents are
@@ -69,6 +69,13 @@
 	checks out all files in the cache before processing listed
 	files.
 
+--prefix=<string>
+	When creating files, prepend this <string> (usually a
+	directory including a trailing /)
+
+--
+	Do not interpret any more arguments as options.
+
 Note that the order of the flags matters:
 
 	checkout-cache -a -f file.c
@@ -96,6 +103,22 @@
 problems (not possible in the above example, but get used to it in
 scripting!).
 
+The prefix ability basically makes it trivial to use checkout-cache as
+a "export as tree" function. Just read the desired tree into the
+index, and do a
+  
+        checkout-cache --prefix=export-dir/ -a
+  
+and checkout-cache will "export" the cache into the specified
+directory.
+  
+NOTE! The final "/" is important. The exported name is literally just
+prefixed with the specified string, so you can also do something like
+  
+        checkout-cache --prefix=.merged- Makefile
+  
+to check out the currently cached copy of "Makefile" into the file
+".merged-Makefile".
 
 ################################################################
 commit-id
@@ -109,12 +132,340 @@
 
 ################################################################
 commit-tree
-	commit-tree <sha1> [-p <sha1>]* < changelog
+	commit-tree <sha1> [-p <parent sha1>]* < changelog
+
+Creates a new commit object based on the provided tree object and
+emits the new commit object id on stdout. If no parent is given then
+it is considered to be an initial tree.
+
+A commit object usually has 1 parent (a commit after a change) or up
+to 16 parents.  More than one parent represents merge of branches that
+led to them.
+
+While a tree represents a particular directory state of a working
+directory, a commit represents that state in "time", and explains how
+to get there.
+
+Normally a commit would identify a new "HEAD" state, and while git
+doesn't care where you save the note about that state, in practice we
+tend to just write the result to the file ".git/HEAD", so that we can
+always see what the last committed state was.
+
+Options
+
+<sha1>
+	An existing tree object
+
+-p <parent sha1>
+	Each -p indicates a the id of a parent commit object.
+	
+
+Commit Information
+
+A commit encapsulates:
+	all parent object ids
+	author name, email and date
+	committer name and email and the commit time.
+
+If not provided, commit-tree uses your name, hostname and domain to
+provide author and committer info. This can be overridden using the
+following environment variables.
+	AUTHOR_NAME
+	AUTHOR_EMAIL
+	AUTHOR_DATE
+	COMMIT_AUTHOR_NAME
+	COMMIT_AUTHOR_EMAIL
+(nb <,> and '\n's are stripped)
+
+A commit comment is read from stdin (max 999 chars)
+
+see also: write-tree
 
 
 ################################################################
+diff-cache
+	diff-cache [-r] [-z] [--cached] <tree/commit sha1>
+
+Compares the content and mode of the blobs found via a tree object
+with the content of the current cache and, optionally ignoring the
+stat state of the file on disk.
+
+(This is basically a special case of diff-tree that works with the
+current cache as the first tree.)
+
+<tree sha1>
+	The id of a tree or commit object to diff against.
+
+-r
+	recurse
+
+-z
+	/0 line termination on output
+
+--cached
+	do not consider the on-disk file at all
+
+Output format:
+
+For files in the tree but not in the cache
+-<mode>\t <type>\t	<sha1>\t	<path><filename>
+
+For files in the cache but not in the tree
++<mode>\t <type>\t	<sha1>\t	<path><filename>
+
+For files that differ:
+*<tree-mode>-><cache-mode>\t <type>\t	<tree sha1>-><cache sha1>\t	<path><filename>
+
+In the special case of the file being changed on disk and out of sync
+with the cache, the sha1 is all 0's.  Example:
+
+	*100644->100660 blob    5be4a414b32cf4204f889469942986d3d783da84->0000000000000000000000000000000000000000      file.c
+	
+
+Operating Modes
+You can choose whether you want to trust the index file entirely
+(using the "--cached" flag) or ask the diff logic to show any files
+that don't match the stat state as being "tentatively changed".  Both
+of these operations are very useful indeed.
+
+Cached Mode
+If --cached is specified, it allows you to ask:
+	show me the differences between HEAD and the current index
+	contents (the ones I'd write with a "write-tree")
+
+For example, let's say that you have worked on your index file, and are
+ready to commit. You want to see eactly _what_ you are going to commit is
+without having to write a new tree object and compare it that way, and to
+do that, you just do
+
+	diff-cache --cached $(cat .git/HEAD)
+
+Example: let's say I had renamed "commit.c" to "git-commit.c", and I had 
+done an "upate-cache" to make that effective in the index file. 
+"show-diff" wouldn't show anything at all, since the index file matches 
+my working directory. But doing a diff-cache does:
+	torvalds@ppc970:~/git> diff-cache --cached $(cat .git/HEAD)
+	-100644 blob    4161aecc6700a2eb579e842af0b7f22b98443f74        commit.c
+	+100644 blob    4161aecc6700a2eb579e842af0b7f22b98443f74        git-commit.c
+
+And as you can see, the output matches "diff-tree -r" output (we
+always do "-r", since the index is always fully populated
+??CHECK??).
+You can trivially see that the above is a rename.
+
+In fact, "diff-tree --cached" _should_ always be entirely equivalent to
+actually doing a "write-tree" and comparing that. Except this one is much
+nicer for the case where you just want to check where you are.
+
+So doing a "diff-cache --cached" is basically very useful when you are 
+asking yourself "what have I already marked for being committed, and 
+what's the difference to a previous tree".
+
+Non-cached Mode
+
+The "non-cached" mode takes a different approach, and is potentially
+the even more useful of the two in that what it does can't be emulated
+with a "write-tree + diff-tree". Thus that's the default mode.  The
+non-cached version asks the question
+
+   "show me the differences between HEAD and the currently checked out 
+    tree - index contents _and_ files that aren't up-to-date"
+
+which is obviously a very useful question too, since that tells you what
+you _could_ commit. Again, the output matches the "diff-tree -r" output to
+a tee, but with a twist.
+
+The twist is that if some file doesn't match the cache, we don't have a
+backing store thing for it, and we use the magic "all-zero" sha1 to show
+that. So let's say that you have edited "kernel/sched.c", but have not
+actually done an update-cache on it yet - there is no "object" associated
+with the new state, and you get:
+
+	torvalds@ppc970:~/v2.6/linux> diff-cache $(cat .git/HEAD )
+	*100644->100664 blob    7476bbcfe5ef5a1dd87d745f298b831143e4d77e->0000000000000000000000000000000000000000      kernel/sched.c
+
+ie it shows that the tree has changed, and that "kernel/sched.c" has is
+not up-to-date and may contain new stuff. The all-zero sha1 means that to
+get the real diff, you need to look at the object in the working directory
+directly rather than do an object-to-object diff.
+
+NOTE! As with other commands of this type, "diff-cache" does not actually 
+look at the contents of the file at all. So maybe "kernel/sched.c" hasn't 
+actually changed, and it's just that you touched it. In either case, it's 
+a note that you need to upate-cache it to make the cache be in sync.
+
+NOTE 2! You can have a mixture of files show up as "has been updated" and
+"is still dirty in the working directory" together. You can always tell
+which file is in which state, since the "has been updated" ones show a
+valid sha1, and the "not in sync with the index" ones will always have the
+special all-zero sha1.
+
+################################################################
 diff-tree
-	diff-tree [-r] [-z] <tree sha1> <tree sha1>
+	diff-tree [-r] [-z] <tree/commit sha1> <tree/commit sha1>
+
+Compares the content and mode of the blobs found via two tree objects.
+
+Note that diff-tree can use the tree encapsulated in a commit object.
+
+<tree sha1>
+	The id of a tree or commit object.
+
+-r
+	recurse
+
+-z
+	/0 line termination on output
+
+--cached
+	Cached only (private?)
+
+Output format:
+
+For files in tree1 but not in tree2
+-<mode>\t <type>\t	<sha1>\t	<path><filename>
+
+For files not in tree1 but in tree2
++<mode>\t <type>\t	<sha1>\t	<path><filename>
+
+For files that differ:
+*<tree1-mode>-><tree2-mode>\t <type>\t	<tree1 sha1>-><tree2 sha1>\t	<path><filename>
+
+
+An example of normal usage is:
+
+	torvalds@ppc970:~/git> diff-tree 5319e4d609cdd282069cc4dce33c1db559539b03 b4e628ea30d5ab3606119d2ea5caeab141d38df7
+	*100664->100664 blob    ac348b7d5278e9d04e3a1cd417389379c32b014f->a01513ed4d4d565911a60981bfb4173311ba3688      fsck-cache.c
+
+which tells you that the last commit changed just one file (it's from
+this one:
+
+	commit 3c6f7ca19ad4043e9e72fa94106f352897e651a8
+	tree 5319e4d609cdd282069cc4dce33c1db559539b03
+	parent b4e628ea30d5ab3606119d2ea5caeab141d38df7
+	author Linus Torvalds <torvalds@ppc970.osdl.org> Sat Apr 9 12:02:30 2005
+	committer Linus Torvalds <torvalds@ppc970.osdl.org> Sat Apr 9 12:02:30 2005
+
+	Make "fsck-cache" print out all the root commits it finds.
+
+	Once I do the reference tracking, I'll also make it print out all the
+	HEAD commits it finds, which is even more interesting.
+
+in case you care).
+
+
+################################################################
+fsck-cache
+	fsck-cache [[--unreachable] <head-sha1>*]
+
+Verifies the connectivity and validity of the objects in the database.
+
+<head-sha1>
+	A commit object to begin an unreachability trace
+
+--unreachable
+	print out objects that exist but that aren't readable from any
+	of the specified root nodes
+
+It tests SHA1 and general object sanity, but it does full tracking of
+the resulting reachability and everything else. It prints out any
+corruption it finds (missing or bad objects), and if you use the
+"--unreachable" flag it will also print out objects that exist but
+that aren't readable from any of the specified root nodes.
+
+So for example
+
+	fsck-cache --unreachable $(cat .git/HEAD)
+
+or, for Cogito users:
+
+	fsck-cache --unreachable $(cat .git/heads/*)
+
+will do quite a _lot_ of verification on the tree. There are a few
+extra validity tests to be added (make sure that tree objects are
+sorted properly etc), but on the whole if "fsck-cache" is happy, you
+do have a valid tree.
+
+Any corrupt objects you will have to find in backups or other archives
+(ie you can just remove them and do an "rsync" with some other site in
+the hopes that somebody else has the object you have corrupted).
+
+Of course, "valid tree" doesn't mean that it wasn't generated by some
+evil person, and the end result might be crap. Git is a revision
+tracking system, not a quality assurance system ;)
+
+Extracted Diagnostics
+expect dangling commits - potential heads - due to lack of head information
+	You haven't specified any nodes as heads so it won't be
+	possible to differentiate between un-parented commits and
+	root nodes.
+
+missing sha1 directory '<dir>'
+	The directory holding the sha1 objects is missing.
+
+unreachable <type> <sha1>
+	The <type> object <sha1>, isn't actually referred to directly
+	or indirectly in any of the trees or commits seen. This can
+	mean that there's another root na SHA1_ode that you're not specifying
+	or that the tree is corrupt. If you haven't missed a root node
+	then you might as well delete unreachable nodes since they
+	can't be used.
+
+missing <type> <sha1>
+	The <type> object <sha1>, is referred to but isn't present in
+	the database.
+
+dangling <type> <sha1>
+	The <type> object <sha1>, is present in the database but never
+	_directly_ used. A dangling commit could be a root node.
+
+warning: fsck-cache: tree <tree> has full pathnames in it
+	And it shouldn't...
+
+sha1 mismatch <sha1>
+	The database has an object who's sha1 doesn't match the
+	database value.
+	This indicates a ??serious?? data integrity problem.
+	(note: this error occured during early git development when
+	the database format changed.)
+
+Environment Variables
+
+SHA1_FILE_DIRECTORY
+	used to specify the object database root (usually .git/objects)
+
+################################################################
+git-export
+	git-export top [base]
+
+probably deprecated:
+On Wed, 20 Apr 2005, Petr Baudis wrote:
+>> I will probably not buy git-export, though. (That is, it is merged, but
+>> I won't make git frontend for it.) My "git export" already does
+>> something different, but more importantly, "git patch" of mine already
+>> does effectively the same thing as you do, just for a single patch; so I
+>> will probably just extend it to do it for an (a,b] range of patches.
+
+
+That's fine. It was a quick hack, just to show that if somebody wants to, 
+the data is trivially exportable.
+
+		Linus
+
+
+################################################################
+init-db
+	init-db
+
+This simply creates an empty git object database - basically a .git
+directory.
+
+If the object storage directory is specified via the
+SHA1_FILE_DIRECTORY environment variable then the sha1 directories are
+created underneath - otherwise the default .git/objects directory is
+used.
+
+init-db won't hurt an existing repository.
 
 
 ################################################################
@@ -134,7 +485,108 @@
 
 ################################################################
 read-tree
-	read-tree [-m] <sha1>
+	read-tree (<sha> | -m <sha1> [<sha2> <sha3>])"
+
+Reads the tree information given by <sha> into the directory cache,
+but does not actually _update_ any of the files it "caches". (see:
+checkout-cache)
+
+Optionally, it can merge a tree into the cache or perform a 3-way
+merge.
+
+Trivial merges are done by read-tree itself.  Only conflicting paths
+will be in unmerged state when read-tree returns.
+
+
+-m
+	Perform a merge, not just a read
+
+<sha#>
+	The id of the tree object(s) to be read/merged.
+
+
+Merging
+Each "index" entry has two bits worth of "stage" state. stage 0 is the
+normal one, and is the only one you'd see in any kind of normal use.
+
+However, when you do "read-tree" with multiple trees, the "stage"
+starts out at 0, but increments for each tree you read. And in
+particular, the old "-m" flag (which used to be "merge with old
+state") has a new meaning: it now means "start at stage 1" instead.
+
+This means that you can do
+
+	read-tree -m <tree1> <tree2> <tree3>
+
+and you will end up with an index with all of the <tree1> entries in
+"stage1", all of the <tree2> entries in "stage2" and all of the
+<tree3> entries in "stage3".
+
+Furthermore, "read-tree" has this special-case logic that says: if you
+see a file that matches in all respects in all three states, it
+"collapses" back to "stage0".
+
+Write-tree refuses to write a nonsensical tree, so write-tree will
+complain about unmerged entries if it sees a single entry that is not
+stage 0".
+
+Ok, this all sounds like a collection of totally nonsensical rules,
+but it's actually exactly what you want in order to do a fast
+merge. The different stages represent the "result tree" (stage 0, aka
+"merged"), the original tree (stage 1, aka "orig"), and the two trees
+you are trying to merge (stage 2 and 3 respectively).
+
+In fact, the way "read-tree" works, it's entirely agnostic about how
+you assign the stages, and you could really assign them any which way,
+and the above is just a suggested way to do it (except since
+"write-tree" refuses to write anything but stage0 entries, it makes
+sense to always consider stage 0 to be the "full merge" state).
+
+So what happens? Try it out. Select the original tree, and two trees
+to merge, and look how it works:
+
+ - if a file exists in identical format in all three trees, it will 
+   automatically collapse to "merged" state by the new read-tree.
+
+ - a file that has _any_ difference what-so-ever in the three trees
+   will stay as separate entries in the index. It's up to "script
+   policy" to determine how to remove the non-0 stages, and insert a
+   merged version.  But since the index is always sorted, they're easy
+   to find: they'll be clustered together.
+
+ - the index file saves and restores with all this information, so you
+   can merge things incrementally, but as long as it has entries in
+   stages 1/2/3 (ie "unmerged entries") you can't write the result.
+
+So now the merge algorithm ends up being really simple:
+
+ - you walk the index in order, and ignore all entries of stage 0,
+   since they've already been done.
+
+ - if you find a "stage1", but no matching "stage2" or "stage3", you
+   know it's been removed from both trees (it only existed in the
+   original tree), and you remove that entry.  - if you find a
+   matching "stage2" and "stage3" tree, you remove one of them, and
+   turn the other into a "stage0" entry. Remove any matching "stage1"
+   entry if it exists too.  .. all the normal trivial rules ..
+
+NOTE NOTE NOTE! although read-tree coule do some of these nontrivial
+merges, only the "matches in all three states" thing collapses by
+default. This is because even though there are other trivial cases
+("matches in both merge trees but not in the original one"), those
+cases might actually be interesting for the merge logic to know about,
+so that information is left around. It should be fairly rare anyway,
+so a few extra index entries are written out to disk so that the merge
+can be annotated.
+
+Incidentally - it also means that you don't even have to have a separate 
+subdirectory for this. All the information literally is in the index file, 
+which is a temporary thing anyway. There is no need to worry about what is in 
+the working directory, since it is never shown and never used.
+
+see also:
+write-tree
+show-files
 
 
 ################################################################
@@ -151,8 +603,145 @@
 show-files
 	show-files [-z] [-t] (--[cached|deleted|others|ignored|stage])*
 
+This merges the file listing in the directory cache index with the
+actual working directory list, and shows different combinations of the
+two.  
+
+--cached
+	Show cached files in the output (default)
+
+--deleted
+	Show deleted files in the output
+
+--others
+	Show other files in the output
+
+--ignored
+	Show ignored files in the output
+
+--stage
+	Show stage files in the output
+
+-t
+	Show the following tags (followed by a space) at the start of
+	each line:
+	H	cached
+	M	unmerged
+	R	removed/deleted
+	?	other
+
+-z
+	/0 line termination on output
+
+Output
+show files just outputs the filename unless --stage is specified in
+which case it outputs:
+
+[<tag> ]<mode> <sha1> <stage> <filename>
+
+show-files --unmerged" and "show-files --stage " can be used to examine
+detailed information on unmerged paths.
+
+For an unmerged path, instead of recording a single mode/SHA1 pair,
+the dircache records up to three such pairs; one from tree O in stage
+1, A in stage 2, and B in stage 3.  This information can be used by
+the user (or Cogito) to see what should eventually be recorded at the
+path. (see read-cache for more information on state)
+
+see also:
+read-cache
+
 
 ################################################################
 unpack-file
 	unpack-file.c <sha1>
 
+################################################################
+update-cache
+	update-cache [--add] [--remove] [--refresh] [--cacheinfo <mode> <sha1> <path>]* [--] [<file>]*
+
+Modifies the index or directory cache. Each file mentioned is updated
+into the cache and any 'unmerged' or 'needs updating' state is
+cleared.
+
+The way update-cache handles files it is told about can be modified
+using the various options:
+
+--add
+	If a specified file isn't in the cache already then it's
+	added.
+	Default behaviour is to ignore new files.
+
+--remove
+	If a specified file is in the cache but is missing then it's
+	removed.
+	Default behaviour is to ignore removed file.
+
+--refresh
+	Looks at the current cache and checks to see if merges or
+	updates are needed by checking stat() information.
+
+--cacheinfo <mode> <sha1> <path>
+	Directly insert the specified info into the cache.
+	
+--
+	Do not interpret any more arguments as options.
+
+<file>
+	Files to act on.
+	Note that files begining with '.' are discarded. This includes
+	"./file" and "dir/./file". If you don't want this, then use	
+	cleaner names.
+	The same applies to directories ending '/' and paths with '//'
+
+
+Using --refresh
+
+--refresh" does not calculate a new sha1 file or bring the cache
+up-to-date for mode/content changes. But what it _does_ do is to
+"re-match" the stat information of a file with the cache, so that you
+can refresh the cache for a file that hasn't been changed but where
+the stat entry is out of date.
+
+For example, you'd want to do this after doing a "read-tree", to link
+up the stat cache details with the proper files.
+
+Using --cacheinfo
+--cacheinfo is used to register a file that is not in the current
+working directory.  This is useful for minimum-checkout merging.
+
+To pretend you have a file with mode and sha1 at path, say:
+
+ $ update-cache --cacheinfo mode sha1 path
+
+
+################################################################
+write-tree
+	write-tree
+
+Creates a tree object using the current cache.
+
+The cache must be merged.
+
+Conceptually, write-tree sync()s the current directory cache contents
+into a set of tree files.
+In order to have that match what is actually in your directory right
+now, you need to have done a "update-cache" phase before you did the
+"write-tree".
+
+
+################################################################
+
+
+
+git Environment Variables
+AUTHOR_NAME
+AUTHOR_EMAIL
+AUTHOR_DATE
+COMMIT_AUTHOR_NAME
+COMMIT_AUTHOR_EMAIL
+GIT_DIFF_CMD
+GIT_DIFF_OPTS
+GIT_INDEX_FILE
+SHA1_FILE_DIRECTORY
+
Index: contrib/gitfeedmaillist.sh
===================================================================

^ permalink raw reply

* Re: [3/5] Add http-pull
From: Daniel Barkalow @ 2005-04-22 19:46 UTC (permalink / raw)
  To: tony.luck; +Cc: Brad Roberts, Petr Baudis, git
In-Reply-To: <200504212205.j3LM5J005103@unix-os.sc.intel.com>

On Thu, 21 Apr 2005 tony.luck@intel.com wrote:

> On Wed, 20 Apr 2005, Brad Roberts wrote:
> > How about fetching in the inverse order.  Ie, deepest parents up towards
> > current.  With that method the repository is always self consistent, even
> > if not yet current.
> 
> Daniel Barkalow replied:
> > You don't know the deepest parents to fetch until you've read everything
> > more recent, since the history you'd have to walk is the history you're
> > downloading.
> 
> You "just" need to defer adding tree/commit objects to the repository until
> after you have inserted all objects on which they depend.  That's what my
> "wget" based version does ... it's very crude, in that it loads all tree
> & commit objects into a temporary repository (.gittmp) ... since you can
> only use "cat-file" and "ls-tree" on things if they live in objects/xx/xxx..xxx
> The blobs can go directly into the real repo (but to be really safe you'd
> have to ensure that the whole blob had been pulled from the network before
> inserting it ... it's probably a good move to validate everything that you
> pull from the outside world too).

The problem with this general scheme is that it means that you have to
start over if something goes wrong, rather than resuming from where you
left off (and being able to use what you got until then). I think a better
solution is to track what things you mean to have and what things you
expect you could get from where.

As for validation, I now have my programs (which I haven't gotten a chance
to send out recently) checking everything as it is downloaded to make sure
it is complete (zlib likes it) and has the correct hash.

	-Daniel
*This .sig left intentionally blank*


^ permalink raw reply

* Re: [PATCH] multi item packed files
From: Linus Torvalds @ 2005-04-22 19:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: Krzysztof Halasa, git
In-Reply-To: <200504221458.36300.mason@suse.com>



On Fri, 22 Apr 2005, Chris Mason wrote:
> 
> The problem I see for git is that once you have enough data, it should degrade 
> over and over again somewhat quickly.

I really doubt that.

There's a more or less constant amount of new data added all the time: the 
number of changes does _not_ grow with history. The number of changes 
grows with the amount of changes going on in the tree, and while that 
isn't exactly constant, it definitely is not something that grows very 
fast. 

Btw, this is how git is able to be so fast in the first place. Git is fast 
because it knows that the "size of the change" is a lot smaller than the 
"size of the repository", so it fundamentally at all points tries to make 
sure that it only ever bothers with stuff that has changed.

Stuff that hasn't changed, it ignores very _very_ efficiently. 

That's really the whole point of the index file: it's a way to quickly
ignore the stuff that hasn't changed - both for simple operations like
"show-diff", but also for complex operations like "merge these three
trees".

And it works exactly because the number of changes does _not_ grow at all 
linearly with the history of the project. In fact, in most projects, the 
rate of change does _down_ when the project grows, because the projects 
matures and generally gets more complicated and thus harder to change.

(The kernel _really_ is pretty special. I am willing to bet that there are
not a lot of big projects that have been able to continue to take changes
at the kind of pace that the kernel does. But we've had to work at it a
lot, including obviously using SCM tools that are very much geared towards
scaling. Why do you think the kernel puts more pressure on SCM's than
other projects? It's exactly because we're trying to scale our change
acceptance to bigger numbers).

So when you say "once you have enough data, it will degrade quickly" 
ignores the fact that the rate of change isn't (the "second derivative of 
the size of the project in time") really isn't that high. 

> I grabbed Ingo's tarball of 28,000 patches since 2.4.0 and applied them all 
> into git on ext3 (htree).  It only took ~2.5 hrs to apply.

Ok, I'd actually wish it took even less, but that's still a pretty
impressive average of three patches a second.

> Anyway, I ended up with a 2.6GB .git directory.  Then I:
> 
> rm .git/index
> umount ; mount again
> time read-tree `tree-id` (24.45s)
> time checkout-cache --prefix=../checkout/ -a -f (4m30s)
> 
> --prefix is neat ;)

That sounds pretty acceptable. Four minutes is a long time, but I assume
that the whole point of the exercise was to try to test worst-case
behaviour.  We can certainly make sure that real usage gets lower numbers
than that (in particular, my "real usage" ends up being 100% in the disk
cache ;)

			Linus

^ permalink raw reply

* Re: "GIT_INDEX_FILE" environment variable
From: Linus Torvalds @ 2005-04-22 19:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <7vzmvr72j6.fsf@assigned-by-dhcp.cox.net>



On Thu, 21 Apr 2005, Junio C Hamano wrote:
> 
> The commands I would want to take paths relative to the user cwd
> are quite limited; note that I just want these available to the
> user and I do not care which one, the core or Cogito, groks the
> cwd relative paths:

I've thought about this, and looked at the sources, and it wouldn't be 
horrible.

HOWEVER, the more I thought about it, the less sense it made. The fact is, 
you can do _exactly_ what you are talking about by just wrapping the calls 
in

	( cd $WORKING_DIR && git-cmd )

which simply doesn't have any downsides that I can see. It always does the 
right thing, and it means that the tools will never have to care about 
what the base is. Keeping the core tools is important, because if they 
mess up, you're in serious trouble. In contrast, if higher levels mess up, 
you're not likely to have caused anything irrevocable.

In fact, I probably shouldn't even have done the "--prefix=" stuff for
check-out, since the common "check out in a new directory" case (not the
"prefix file" case can be pretty easily emulated with a fairly trivial 
script, something like

	#!/bin/sh
	CURRENT_DIR=$(pwd)
	GIT_INDEX_FILE=${GIT_INDEX_FILE:-$CURRENT_DIR/.git/index}
	SHA1_FILE_DIRECTORY=${SHA1_FILE_DIRECTORY:-$CURRENT_DIR/.git/objects}
	TARGET=$1
	shift 1
	mkdir $TARGET && cd $TARGET && checkout-cache "$@"

but since it was (a) very easy to add to that particular program, and (b) 
exporting a while directory is pretty fundamental, I'll just leave that 
strange special case around.

So to the core tools, there really _are_ just two special things: the 
index file, and the place where to find the sha1 objects.  The working 
directory is really nothing but "pwd", which can be trivially changed 
before invocation, ie the addition of a new environment variable really 
doesn't _buy_ anything except for complexity.

		Linus

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox