Git development
 help / color / mirror / Atom feed
* Re: Change "pull" to _only_ download, and "git update"=pull+merge?
From: Petr Baudis @ 2005-04-20 20:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schlemmer, David Greaves, dwheeler, Daniel Barkalow, git
In-Reply-To: <20050420070157.GA12584@elte.hu>

Dear diary, on Wed, Apr 20, 2005 at 09:01:57AM CEST, I got a letter
where Ingo Molnar <mingo@elte.hu> told me that...
>  [...]
>  fatal: unable to execute 'gitmerge-file.sh'
>  fatal: merge program failed

Pure stupidity of mine, I forgot to add gitmerge-file.sh to the list of
scripts which get installed.

> another thing: it's confusing that during 'git pull', the rsync output 
> is not visible. Especially during large rsyncs, it would be nice to see 
> some progress. So i usually use a raw rsync not 'git pull', due to this.

Fixed. For further reference, you can also set RSYNC_FLAGS and put
whatever pleases you there.

> yet another thing: what is the canonical 'pasky way' of simply nuking 
> the current files and checking out the latest tree (according to 
> .git/HEAD). Right now i'm using a script to:
> 
>   read-tree $(tree-id $(cat .git/HEAD))
>   checkout-cache -a
> 
> (i first do an 'rm -f *' in the working directory)
> 
> i guess there's an existing command for this already?

git cancel

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Re: git-viz tool for visualising commit trees
From: Olivier Andrieu @ 2005-04-20 20:00 UTC (permalink / raw)
  To: mingo; +Cc: pasky, git
In-Reply-To: <20050420100824.GB25477@elte.hu>

 > Ingo Molnar <mingo@elte.hu> [Wed, 20 Apr 2005]:
 > 
 > * Petr Baudis <pasky@ucw.cz> wrote:
 > 
 > >   Hi,
 > > 
 > >   just FYI, Olivier Andrieu was kind enough to port his monotone-viz 
 > > tool to git (http://oandrieu.nerim.net/monotone-viz/ - use the one 
 > > from the monotone repository). The tool visualizes the history flow 
 > > nicely; see
 > > 
 > > 	http://rover.dkm.cz/~pasky/gitviz1.png
 > > 	http://rover.dkm.cz/~pasky/gitviz2.png
 > > 	http://rover.dkm.cz/~pasky/gitviz3.png
 > > 	http://rover.dkm.cz/~pasky/gitviz4.png
 > > 	http://rover.dkm.cz/~pasky/gitviz5.png
 > > 	http://rover.dkm.cz/~pasky/gitviz6.png
 > > 	http://rover.dkm.cz/~pasky/gitviz7.png
 > > 
 > > for some screenshots.
 > 
 > really nice stuff! Any plans to include it in git-pasky, via 'git gui' 
 > option or so? Also, which particular version has this included - the 
 > freshest tarball on the monotone-viz download site doesnt seem to 
 > include it.

I'll post a tarball soon. You can also get it from the monotone
repository, but I wouldn't recommend it unless you want to try
monotone as well : that involves a rather large download.

-- 
   Olivier

^ permalink raw reply

* [PATCH] gittrack.sh accepts invalid branch names
From: Pavel Roskin @ 2005-04-20 19:48 UTC (permalink / raw)
  To: git, Petr Baudis

Hello, Petr and everybody!

gittrack.sh allows abbreviated branch names, e.g. it's possible to run
"git track lin" when there is a branch called "linus".

I believe it's a bug, not a feature.  Please look at this line from
gittrack.sh:

grep -q $(echo -e "^$name\t" | sed 's/\./\\./g') .git/remotes

The result of command expansion is subjected to word splitting, which
means the trailing tab is removed as a space.  So grep doesn't see the
tab.

The way to avoid word splitting would be to quote "$()", but it would
make the shell code too hairy.  I'm not even sure all shells would
interpret "$("$name")" correctly.

So I decided to use tab directly in the sed expression.  I cannot think
of any portable way to avoid grep completely ("q" is a GNU sed
extension, and we want to support BSD, I think), so it's still there,
looking for any output from sed.

Signed-off-by: Pavel Roskin <proski@gnu.org>

--- a/gittrack.sh
+++ b/gittrack.sh
@@ -35,7 +35,7 @@ die () {
 mkdir -p .git/heads
 
 if [ "$name" ]; then
-	grep -q $(echo -e "^$name\t" | sed 's/\./\\./g') .git/remotes || \
+	sed -ne "/^$name\t/p" .git/remotes | grep -q . || \
 		[ -s ".git/heads/$name" ] || \
 		die "unknown branch \"$name\""
 

-- 
Regards,
Pavel Roskin


^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Linus Torvalds @ 2005-04-20 19:47 UTC (permalink / raw)
  To: Chris Mason; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504201218360.6467@ppc970.osdl.org>



On Wed, 20 Apr 2005, Linus Torvalds wrote:
>
> It would be nicer for the cache to make the index file "header" be a 
> "footer", and write it out last - that way we'd be able to do the SHA1 as 
> we write rather than doing a two-pass thing. That's for another time.

That other time was now.

The header is still a header, but the sha1 is now at the end of the file, 
which means that the header version has been incremented by 1 (to 2).

This is also sadly an incompatible change, so once you update and install
the new tools, you'll need to do

	tree=$(cat-file commit $(cat .git/HEAD) | sed 's/tree //;q')
	read-tree $tree
	update-cache --refresh

to re-build your index file.

Sorry about that, but the end result should be quite fast (especially if
your sha1 is fast). The best benchmark is probably to just do a "time
update-cache Makefile" in the kernel (before and after), when the cache
was already up-to-date and with no time spent on stating lots of files.  
That kind of "one file changed" timing is actually the common case (in
this case Makefile won't have changed, but update-cache doesn't care).

(Of course, I could optimize it to notice that the update-cache didn't do
anything and avoid the write altogether, but that's likely optimizing for
the wrong case, since normally you'd call update-cache when you know
something changed).

Yeah, it's somewhat silly doing optimizations at this point, but I want to
make sure that the data structures are all ready for a real release, and
as part of that I want to make sure there are no stupid low-hanging fruit
that we'll curse later. Better get it done with now.

			Linus

^ permalink raw reply

* Re: git-viz tool for visualising commit trees
From: Petr Baudis @ 2005-04-20 19:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: git, oliv__a
In-Reply-To: <20050420100824.GB25477@elte.hu>

Dear diary, on Wed, Apr 20, 2005 at 12:08:24PM CEST, I got a letter
where Ingo Molnar <mingo@elte.hu> told me that...
> * Petr Baudis <pasky@ucw.cz> wrote:
> >   just FYI, Olivier Andrieu was kind enough to port his monotone-viz 
> > tool to git (http://oandrieu.nerim.net/monotone-viz/ - use the one 
> > from the monotone repository). The tool visualizes the history flow 
> > nicely; see
> > for some screenshots.
> 
> really nice stuff! Any plans to include it in git-pasky, via 'git gui' 
> option or so? Also, which particular version has this included - the 
> freshest tarball on the monotone-viz download site doesnt seem to 
> include it.

AFAIK you need Monotone and grab it from the monotone repository.

git gui sounds interesting, but perhaps in longer horizon, and perhaps
not as an integral part of git-pasky. I don't know ocaml and it's rather
large thing.

Point'n'drag merges, anyone? ;-))

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Linus Torvalds @ 2005-04-20 19:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: git
In-Reply-To: <200504201504.59541.mason@suse.com>



On Wed, 20 Apr 2005, Chris Mason wrote:
> 
> Well, the difference there should be pretty hard to see with any benchmark.
> But I was being lazy...new patch attached.  This one gets the same perf 
> numbers, if this is still wrong then I really need some more coffee.

I did my preferred version. Makes a big difference here too.

It would be nicer for the cache to make the index file "header" be a 
"footer", and write it out last - that way we'd be able to do the SHA1 as 
we write rather than doing a two-pass thing. That's for another time.

		Linus

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Chris Mason @ 2005-04-20 19:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504201040400.6467@ppc970.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

On Wednesday 20 April 2005 13:52, Linus Torvalds wrote:
> On Wed, 20 Apr 2005, Chris Mason wrote:
> > The patch below with your current tree brings my 100 patch test down to
> > 22 seconds again.
>
> If you ever have a cache_entry bigger than 16384, your code will write
> things out in the wrong order (write the new cache without flushing the
> old buffer).

Whoops

> Finally, if you really want to go fast, you should really try to make your
> writes powers-of-two, ie fill up the buffer entirely rather than saying
> "if I were to overflow, flush it now". It doesn't matter that much for
> some filesystems (especially local and append-only like the patterns are
> here), but it can definitely matter for the stupid ones.

Well, the difference there should be pretty hard to see with any benchmark.
But I was being lazy...new patch attached.  This one gets the same perf 
numbers, if this is still wrong then I really need some more coffee.

-chris


[-- Attachment #2: read-cache-fast.diff --]
[-- Type: text/x-diff, Size: 1410 bytes --]

--- linus.back/read-cache.c	2005-04-20 10:14:23.268310000 -0400
+++ linus/read-cache.c	2005-04-20 14:54:28.554518320 -0400
@@ -232,11 +232,13 @@
 	SHA_CTX c;
 	struct cache_header hdr;
 	int i;
+	#define BUFLEN 16384
+	static char buf[BUFLEN];
+	int len = 0;
 
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
 	hdr.hdr_version = htonl(1);
 	hdr.hdr_entries = htonl(entries);
-
 	SHA1_Init(&c);
 	SHA1_Update(&c, &hdr, offsetof(struct cache_header, sha1));
 	for (i = 0; i < entries; i++) {
@@ -246,13 +248,37 @@
 	}
 	SHA1_Final(hdr.sha1, &c);
 
-	if (write(newfd, &hdr, sizeof(hdr)) != sizeof(hdr))
-		return -1;
-
+	/* hdr is small right now, but just
+	 * in case someone changes that...
+	 */
+	if (sizeof(hdr) < BUFLEN) {
+		memcpy(buf, &hdr, sizeof(hdr));
+		len += sizeof(hdr);
+	} else {
+		if (write(newfd, &hdr, sizeof(hdr)) != sizeof(hdr))
+			return -1;
+	}
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
 		int size = ce_size(ce);
-		if (write(newfd, ce, size) != size)
+		char *p = (char *)ce;
+		while(size > 0) {
+			int count = size;
+			if (count > BUFLEN - len)
+				count = BUFLEN - len;
+			memcpy(buf + len, p, count);
+			size -= count;
+			len += count;
+			p += count;
+			if (len == BUFLEN) {
+				if (write(newfd, buf, len) != len)
+					return -1;
+				len = 0;
+			}
+		}
+	}
+	if (len) {
+		if (write(newfd, buf, len) != len)
 			return -1;
 	}
 	return 0;

^ permalink raw reply

* Re: SHA1 hash safety
From: David Meybohm @ 2005-04-20 18:56 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Andy Isaacson, omb, git
In-Reply-To: <Pine.LNX.4.61.0504191848300.29929@cag.csail.mit.edu>

On Tue, Apr 19, 2005 at 06:48:57PM -0400, C. Scott Ananian wrote:
> On Tue, 19 Apr 2005, David Meybohm wrote:
> 
> >But doesn't this require assuming the distribution of MD5 is uniform,
> >and don't the papers finding collisions in less show it's not? So, your
> >birthday-argument for calculating the probability wouldn't apply, because
> >it rests on the assumption MD5 is uniform, and it isn't.
> 
> No, the collision papers don't show this at all.

I didn't mean they showed it directly. I meant by finding collisions in
MD5 quickly, MD5 would have to have some non-uniformity. But that's
nevertheless wrong because uniformness and collision finding ability
aren't related. Sorry to have wasted everyone's time.

Dave

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: David S. Miller @ 2005-04-20 18:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mason, git
In-Reply-To: <Pine.LNX.4.58.0504200957030.6467@ppc970.osdl.org>

On Wed, 20 Apr 2005 10:06:15 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> I bet your SHA1 implementation is done with hand-optimized and scheduled
> x86 MMX code or something, while my poor G5 is probably using some slow
> generic routine. As a result, it only improved by 33% for me since the
> compression was just part of the picture, but with your cheap SHA1 the
> compression costs really dominated, and so it's almost four times faster
> for you.

The openssl tree has a i586 optimized SHA1 implementation.
A quick scan of the 0.9.7e tree I happen to have lying around
shows there aren't optimized for other cpus in there, just i586.

^ permalink raw reply

* distributed merge prototype
From: Matt Mackall @ 2005-04-20 17:50 UTC (permalink / raw)
  To: git

I've hacked together a prototype SCM that I think you folks might be
interested in. The announcement is here:

 http://selenic.com/mercurial/announce.txt

It's at a very early stage right now and is likely to break if you
look at it wrong, but I have sucessfully managed to check in kernel
trees, do a local clone/branch, make changes in both trees, and then
do a pull/sync which called up tkdiff where appropriate.

I mention it here because I've got a fairly simple implementation of
distributed merge ala Monotone or BK with the necessary graph smarts.
It also should perform decently - I've paid a lot of attention to the
scalability of the core algorithms. The core of the merge code is less
than 100 lines so even people who aren't familiar with Python may be
able to wrap their head around it and leverage it for git.

I'd also like to encourage more attention to back-end storage.
Mercurial can check in all 495 versions of linux/Makefile from bkcvs
to compressed delta store in about 5 seconds on my laptop and the
result is about 80K (bkcvs takes 254K). Adding and retrieving
revisions is O(1).

The same directory individually compressed by gzip (ie what git does)
takes a comparable amount of time and 5.1M of disk space. This is
admittedly a worst case for git as most of the deltas are small, but I
needed a test file with lots of revisions.

Now back to our regularly scheduled programming..

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Linus Torvalds @ 2005-04-20 17:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: git
In-Reply-To: <200504201323.05447.mason@suse.com>



On Wed, 20 Apr 2005, Chris Mason wrote:
> 
> The patch below with your current tree brings my 100 patch test down to 22 
> seconds again.

If you ever have a cache_entry bigger than 16384, your code will write 
things out in the wrong order (write the new cache without flushing the 
old buffer).

You also don't free the buffer.

Finally, if you really want to go fast, you should really try to make your
writes powers-of-two, ie fill up the buffer entirely rather than saying
"if I were to overflow, flush it now". It doesn't matter that much for
some filesystems (especially local and append-only like the patterns are
here), but it can definitely matter for the stupid ones.

But yeah, we could obviously chunk things out properly. You might want to 
just use stdio and "fwrite()", though, which does all of that for you, and 
hopefully does it right.

(I'm not a big fan of stdio for something like this, so if you want to 
create a little helper function that just does the chunking, go wild. 
Something like

	#define BUFSIZ 8192
	static char buffer[BUFSIZ];
	static unsigned long buflen;

	int ce_write(int fd, void *data, unsigned int len)
	{
		while (len) {
			unsigned int buffered = buflen;
			unsigned int partial = BUFSIZ - buflen;
			if (partial > len)
				partial = len;
			memcpy(buffer + buflen, data, partial);
			buffered += partial;
			if (buffered == BUFSIZ) {
				if (write(fd, buffer, BUFSIZ) != BUFSIZ)
					die("unable to write");
				buffered = 0;
			}
			buflen = buffered;
			len -= partial;
			data += partial;
		}
	}

	int ce_flush(int fd)
	{
		unsigned int left = buflen;
		if (left) {
			buflen = 0;
			if (write(fd, buffer, left) != left)
				die("unable to write");
		}
	}

which should be ok, and cheesily avoids the allocation overhread issues by
just having a nice static buffer.

"If you want to go fast, do it right".

Untested, as usual.

		Linus

^ permalink raw reply

* Re: [PATCH] Some documentation...
From: David Greaves @ 2005-04-20 17:35 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: git
In-Reply-To: <Pine.LNX.4.61.0504201321380.2630@cag.csail.mit.edu>

C. Scott Ananian wrote:
> On Wed, 20 Apr 2005, David Greaves wrote:
> 
>> In doing this I noticed a couple of points:
>> * update-cache won't accept ./file or fred/./file
> 
> 
> The comment in update-cache.c reads:
> /*
>  * We fundamentally don't like some paths: we don't want
>  * dot or dot-dot anywhere, and in fact, we don't even want
>  * any other dot-files (.git or anything else). They
>  * are hidden, for chist sake.
>  *
>  * Also, we don't want double slashes or slashes at the
>  * end that can make pathnames ambiguous.
>  */
> 
> It could be argued that './' is a special case... but at the moment this 
> is definitely a designed 'feature' not a 'bug'.

Indeed - I've been reading the code to document it as correctly as possible.

But I actually found this by running:

   find . -type f | xargs git add

for a new project - so I'd class it as user unfriendly...
Yes, I know how to get round it :)

I have ensured that my next perl version of gitadd.pl (that I submitted 
to Petr) doesn't allow these files to be added - and it could even 
cleanse leading ./ and any /./ constructs.

So maybe it's left as documented behaviour and higher level tools must 
manage the data they feed to it...

I hope it's useful to raise these niggles now before changing them is 
too hard.

David

-- 

^ permalink raw reply

* Blob chunking code. [Second look]
From: C. Scott Ananian @ 2005-04-20 17:31 UTC (permalink / raw)
  To: Git Mailing List
In-Reply-To: <Pine.LNX.4.61.0504200917070.28851@cag.csail.mit.edu>

Here's a quick rev of the chunking code.  This is compatible with 
git-current, where the hashes are of the *uncompressed* file.
The 'chunk' file gets dropped in at the same SHA1 filename as the
'blob' file, as it represents identical contents.  Martin won't like
this (because of how the hash is computed), but this is the short-term
direction I want to pursue to validate the concept: it means I can
run a simple converter over all the blob objects and don't have to
rewrite tree and commit objects.

If the approach is seen to have merit, then we can perhaps think about 
doing another bulk repository format conversion where all the hashes
change.  But (IMO) it's a little early to be thinking of this yet.
  --scott

nuclear RUCKUS KUPALM ODACID LA STANDEL Mossad LITEMPO atomic mail drop 
Hussein JUBILIST class struggle SSBN 731 Bush quiche Nazi MKULTRA
                          ( http://cscott.net/ )
---------  chunk.c ----------
/*
  * This file implements a treap-based chunked content store.  The
  * idea is that every stored file is broken down into tree-structured
  * chunks (that is, every chunk has an optional 'prefix' and 'suffix'
  * chunk), and these chunks are put in the object store.  This way
  * similar files will be expected to share chunks, saving space.
  * Files less than one disk block long are expected to fit in a single
  * chunk, so there is no extra indirection overhead for this case.
  *
  * Copyright (C) 2005 C. Scott Ananian <cananian@alumni.princeton.edu>
  */

/*
  * We assume that the file and the chunk information all fits in memory.
  * A slightly more-clever implementation would work even if the file
  * didn't fit.  Basically, we could scan it an keep the
  * 'N' lowest heap keys (chunk hashes), where 'N' is chosen to fit
  * comfortably in memory.  These would form the root and top
  * of the resulting treap, constructing it top-down.  Then we'd scan
  * again any only keep the next 'N' lowest heap keys, etc.
  *
  * But we're going to keep things simple.  We do try to maintain locality
  * where possible, so if you need to swap things still shouldn't be too bad.
  */

#include <assert.h>
#include <stdlib.h>
#include "cache.h"
#include "chunk.h"

typedef unsigned long ch_size_t;

/* Our magic numbers: these can be tuned without breaking files already
  * in the archive, although space re-use is only expected between files which
  * have these constants set to the same values. */

/* The window size determines how much context we use when looking for a
  * chunk boundary.
  * C source has approx 5 bits per character of entropy.
  * We'd like to get 32 bits of good entropy into our boundary checksum;
  * that means 7 bytes is a rough minimum for the window size.
  * 30 bytes is what 'rsyncable zlib' uses; that should be fine. */
#define ROLLING_WINDOW 30
/* The ideal chunk size will fit most chunks into a disk block.  A typical
  * disk block size is 4k, and we expect (say) 50% compression. */
#define CHUNK_SIZE 7901 /* primes are nice to use */

/* Data structures: */
struct chunk {
     /* a chunk represents some range of the underlying file */
     ch_size_t start /* inclusive */, end /*exclusive*/;
     unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */
};
struct chunklist {
     /* a dynamically-sized list of chunks */
     struct chunk *chunk; /* an array of chunks */
     ch_size_t num_items; /* how many items are currently in the list */
     ch_size_t allocd;    /* how many items we've allocated space for */
};
struct treap {
     /* A treap node represents a run of consecutive chunks. */

     /* the start and end of the run: */
     ch_size_t start /* inclusive */, end /*exclusive*/;
     struct chunk *chunk; /* some chunk in the run. */
     /* treaps representing the run before 'chunk' (left) and
      * after 'chunk' (right).  */
     struct treap *left, *right;
     /* sha1 for the run represented by this treap */
     unsigned char sha1[20];
};

static struct chunklist *
create_chunklist(int expected_items) {
     struct chunklist *cl = malloc(sizeof(*cl));
     cl->num_items = 0;
     cl->allocd = expected_items;
     cl->chunk = malloc(sizeof(cl->chunk[0]) * cl->allocd);
     return cl;
}
static void
free_chunklist(struct chunklist *cl) {
     free(cl->chunk);
     free(cl);
}

/* Add a chunk to the chunk list, calculating its SHA1 in the process. */
/* The chunk includes buf[start] to buf[end-1].                        */
static void
add_chunk(struct chunklist *cl, char *buf, ch_size_t start, ch_size_t end) {
     struct chunk *ch;
     SHA_CTX c;
     assert(start<end); assert(cl); assert(buf);
     if (cl->num_items >= cl->allocd) {
 	cl->allocd = cl->allocd*3/2;
 	cl->chunk = realloc(cl->chunk, cl->allocd * sizeof(*(cl->chunk)));
     }
     assert(cl->num_items < cl->allocd);
     ch = cl->chunk + (cl->num_items++);
     ch->start = start;
     ch->end = end;
     /* compute SHA-1 of the chunk. */
     SHA1_Init(&c);
     SHA1_Update(&c, buf+start, end-start);
     SHA1_Final(ch->sha1, &c);
     /* done! */
}

/* Split a buffer into chunks, using a rolling checksum over ROLLING_WINDOW
  * bytes to determine chunk boundaries.  We try to split chunks into pieces
  * whose size averages out to be 'CHUNK_SIZE'. */
static void
chunkify(struct chunklist *cl, char *buf, ch_size_t size) {
     int i, rsync_s1=0, rsync_s2=0, last=-1;
     /* While window is filling: */
     for (i=0; i<ROLLING_WINDOW && i<size; i++) {
 	/* add one to char so that leading 0s don't behave strangely. */
 	rsync_s1 = (rsync_s1 + (1 + (unsigned char)buf[i])) & 0xFFFF;
 	rsync_s2 = (rsync_s2 + rsync_s1) & 0xFFFF;
 	/* Is this the end of a chunk? */
 	if (0 == ((rsync_s1 + rsync_s2) % CHUNK_SIZE)) {
 	    add_chunk(cl, buf, last+1, i+1);
 	    last = i;
 	}
     }
     /* After window is full: */
     for ( ; i<size; i++) {
 	/* Old character out */
 	rsync_s1 = (rsync_s1 - (1 + (unsigned char)buf[i-ROLLING_WINDOW])) & 0xFFFF;
 	rsync_s2 = (rsync_s2 - ROLLING_WINDOW * (1 + (unsigned char)buf[i-ROLLING_WINDOW])) & 0xFFFF;
 	/* New character in */
 	rsync_s1 = (rsync_s1 + (1 + (unsigned char)buf[i])) & 0xFFFF;
 	rsync_s2 = (rsync_s2 + rsync_s1) & 0xFFFF;
 	/* Is this the end of a chunk? */
 	if (0 == ((rsync_s1 + rsync_s2) % CHUNK_SIZE)) {
 	    add_chunk(cl, buf, last+1, i+1);
 	    last = i;
 	}
     }
     /* One last chunk at the end: */
     if (last+1!=size)
 	add_chunk(cl, buf, last+1, size);
     /* done! */
}

/* A treap is a 'heap-ordered tree'.  There are two constraints maintained:
  *   left tree key < this tree key < right tree key
  * and
  *   this heap key < left and right heap keys.
  * We use the sha1 of the chunk (chunk->sha1) as the heap key and the
  * file location (chunk->start) as the tree key.
  * For more info on treaps, see:
  *   C. R. Aragon and R. G. Seidel, "Randomized search trees",
  *   Proc. 30th IEEE FOCS (1989), 540-545.
  * There are many possible binary trees we could build; enforcing the
  * heap constraint ensures that similar files will build similar trees.
  * (The root of the constructed tree will always be the chunk with the
  *  smallest hash key; it's left child will be the chunk with the smallest
  *  hash among those chunk before the root in file order; and so on
  *  recursively.)
  */

/* Assertion helper: check tree and heap constraints. */
static int
treap_valid(struct treap *t) {
     int valid = 1;
     if (!t) return 1;
     if (t->chunk==NULL) return 0;
     if (t->left!=NULL) {
 	/* Tree constraint. */
 	valid = valid && (t->left->chunk->start < t->chunk->start);
 	/* Heap constraint. */
 	valid = valid && (memcmp(t->chunk->sha1, t->left->chunk->sha1,
 				 sizeof(t->chunk->sha1)) < 0);
 	/* 'start' validity */
 	valid = valid && (t->start == t->left->start);
     } else
 	valid = valid && (t->start == t->chunk->start);
     if (t->right!=NULL) {
 	/* Tree constraint. */
 	valid = valid && (t->chunk->start < t->right->chunk->start);
 	/* Heap constraint. */
 	valid = valid && (memcmp(t->chunk->sha1, t->right->chunk->sha1,
 				 sizeof(t->chunk->sha1)) < 0);
 	/* 'end' validity. */
 	valid = valid && (t->end == t->right->end);
     } else
 	valid = valid && (t->end == t->chunk->end);
     return valid;
}

/* Restore heap constraint without disturbing tree ordering. */
/* Only the root of the given treap will violate the heap constraint. */
static struct treap *
treapify(struct treap *t) {
     struct treap *x, *y, *a, *b, *c;
     int left_ok, right_ok, rotate_left;
     assert(treap_valid(t->left));
     assert(treap_valid(t->right));
     left_ok = (t->left == NULL) ||
 	(memcmp(t->chunk->sha1, t->left->chunk->sha1,
 		sizeof(t->chunk->sha1)) < 0);
     right_ok = (t->right == NULL) ||
 	(memcmp(t->chunk->sha1, t->right->chunk->sha1,
 		sizeof(t->chunk->sha1)) < 0);
     if (left_ok && right_ok) { /* well, that's easy */
 	assert(treap_valid(t));
 	return t;
     }
     /* okay, someone needs to rotate */
     rotate_left = (!left_ok) &&
 	(right_ok || /* if neither is okay, then rotate smallest up */
 	 memcmp(t->left->chunk->sha1, t->right->chunk->sha1,
 		sizeof(t->chunk->sha1)) < 0);
     /*   Rotation:
      *     y   -bring left up->  x
      *    / \                   / \
      *   x   c                 a   y
      *  / \                       / \
      * a   b <-bring right up-   b   c
      */
     if (rotate_left) {
 	y = t;  x = y->left;  c = y->right;  a = x->left;  b = x->right;
 	y->left = b;
 	y->right = c;
 	y->start = y->left ? y->left->start : y->chunk->start;
 	y->end = y->right ? y->right->end : y->chunk->end;
 	x->left = a;
 	x->right = treapify(y); // recurse to check heap constraint
 	x->start = x->left ? x->left->start : x->chunk->start;
 	x->end = x->right ? x->right->end : x->chunk->end;
 	assert(treap_valid(x));
 	return x;
     } else {
 	x = t;  a = x->left;  y = x->right;  b = y->left;  c = y->right;
 	x->left = a;
 	x->right = b;
 	x->start = x->left ? x->left->start : x->chunk->start;
 	x->end = x->right ? x->right->end : x->chunk->end;
 	y->right = c;
 	y->left = treapify(x); // recurse to check heap constraint.
 	y->start = y->left ? y->left->start : y->chunk->start;
 	y->end = y->right ? y->right->end : y->chunk->end;
 	assert(treap_valid(y));
 	return y;
     }
}

/* Use list of chunks to build treap bottom-up, calling treapify to
  * restore heap order on the subtree after we add each interior node.
  * This is O(N), where N is the number of chunks. */
static struct treap *
build_treap(struct chunklist *cl, int chunk_st, int chunk_end) {
     struct treap *result;
     /* Some treaps are trivial to build: */
     if (chunk_st >= chunk_end) return NULL;
     /* Claim a chunk in the middle for ourself. */
     int c = (chunk_st + chunk_end)/2;
     result = (struct treap *)malloc(sizeof(*result));
     result->chunk = &(cl->chunk[c]);
     /* Divide and conquer: build well-formed treaps for our kids.*/
     result->left = build_treap(cl, chunk_st, c);
     result->right = build_treap(cl, c+1, chunk_end);
     result->start = result->left ? result->left->start : result->chunk->start;
     result->end = result->right ? result->right->end : result->chunk->end;
     /* Now we need to ensure that the heap constraint is satisfied; that is,
      * result->chunk->sha1 < result->left->chunk->sha1  and
      * result->chunk->sha1 < result->right->chunk->sha1.
      */
     assert(treap_valid(result->left));
     assert(treap_valid(result->right));
     return treapify(result);
}

static void
free_treap(struct treap *t) {
     if (!t) return;
     free_treap(t->left);
     free_treap(t->right);
     free(t);
}

static int
treap_depth(struct treap *t) {
     int l, r;
     if (!t) return 0;
     l = treap_depth(t->left);
     r = treap_depth(t->right);
     return 1 + ((l > r) ? l : r);
}

/* Fill in the treap hashes.  This will be O(N ln M), where N is the
  * file length and M is the number of chunks.  We could actually do
  * this in 2*N time if the subtree hashes were prefix-identical.
  * Since we need to include the chunk length in the hash prefix,
  * we can't reuse the hashing context and we need to pay the extra
  * O(ln M) factor. */
static void
do_treap_hash(struct treap *t, void *data, SHA_CTX *accum, int accum_len) {
     char prefix[200];
     SHA_CTX *cp;
     int i;

     assert(treap_valid(t));
     if (!t) return;

     /* Start a new treap context. */
     cp = &(accum[accum_len++]);
     SHA1_Init(cp);
     /* Sticking the size in the prefix makes me unhappy. =( */
     SHA1_Update(cp, prefix, 1+sprintf(prefix, "blob %lu", t->end - t->start));
     /* Recurse on the left. */
     do_treap_hash(t->left, data, accum, accum_len);
     /* Add in our chunk. */
     for (i=0; i<accum_len; i++)
 	SHA1_Update(accum + i, data + t->chunk->start,
 		    t->chunk->end - t->chunk->start);
     /* Recurse on the right. */
     do_treap_hash(t->right, data, accum, accum_len);
     /* Finalize and write it to t->sha1. */
     SHA1_Final(t->sha1, cp);
     /* Done! */
}
/* Helper method. */
static void
compute_treap_hashes(struct treap *t, void *data) {
     /* Allocate space for each level of the treap to have its own context. */
     SHA_CTX contexts[treap_depth(t)];
     do_treap_hash(t, data, contexts, 0);
}
/* Yuck. */
static const char *
compute_null_treap_hash() {
     static const char fixed[] = { "blob 0" };
     static char sha1[20], *cp=NULL;
     SHA_CTX c;
     if (cp) return cp;
     SHA1_Init(&c);
     SHA1_Update(&c, fixed, sizeof(fixed));
     SHA1_Final(sha1, &c);
     cp = sha1;
     return cp;
}


/* Now that we've broken it down into treap-structured pieces, let's write
  * them to the object store. */

/* Write a single treap piece to the object store.  Note that 't' may be
  * NULL for the special case of a zero-byte file.  Writes the hash of
  * this piece back to 'sha1', which must be non-NULL. Returns 0 on success.*/
static int
write_one(struct treap *t, char *buf) {
/* two hundred bytes is two 20-byte SHA1 hashes, two presence bytes,
  * six bytes of type, one null, and plus 10^151 file length. (Conservative.) */
#define MAX_METADATA_LEN 200
     z_stream stream;
     ch_size_t max_out_bytes;
     ch_size_t chunk_size = t ? (t->chunk->end - t->chunk->start) : 0;
     ch_size_t content_size, metadata_size;
     char metadata[MAX_METADATA_LEN];
     void *out;

     /*
      * Metadata: Type, ASCII size, null byte, then left & right hashes.
      */
     content_size = chunk_size+2; /* prefix/suffix delimiters */
     if (t && t->left) content_size += sizeof(t->left->sha1);
     if (t && t->right) content_size += sizeof(t->right->sha1);
     metadata_size =  1+sprintf(metadata, "chunk %lu", content_size);
     if (t && t->left) { /* left hash */
 	metadata[metadata_size++] = 1;
 	memcpy(metadata + metadata_size, t->left->sha1, sizeof(t->left->sha1));
 	metadata_size += sizeof(t->left->sha1);
     } else
 	metadata[metadata_size++] = 0; /* no prefix chunk */
     if (t && t->right) { /* right hash */
 	metadata[metadata_size++] = 1;
 	memcpy(metadata + metadata_size,t->right->sha1,sizeof(t->right->sha1));
 	metadata_size += sizeof(t->right->sha1);
     } else
 	metadata[metadata_size++] = 0; /* no suffix chunk */

     memset(&stream, 0, sizeof(stream));
     deflateInit(&stream, Z_BEST_COMPRESSION);
     max_out_bytes = deflateBound(&stream, chunk_size+metadata_size);
     out = malloc(max_out_bytes);
     stream.next_out = out;
     stream.avail_out = max_out_bytes;

     /* Compress metadata. */
     stream.next_in = metadata;
     stream.avail_in = metadata_size;
     while (deflate(&stream, 0) == Z_OK)
 	    /* nothing */;

     /*
      * Chunk content.
      */
     stream.next_in = buf + ( t ? t->chunk->start : 0);
     stream.avail_in = chunk_size; /* possibly zero */
     while (deflate(&stream, Z_FINISH) == Z_OK)
 	/* nothing */;

     deflateEnd(&stream);

     return write_sha1_buffer(t ? (const char*) t->sha1 :
 			     compute_null_treap_hash(),
 			     out, stream.total_out);
}

/* Write all treap nodes to disk. */
static int
write_treap(struct treap *t, char *buf, char *sha1) {
     /* First write children (which initializes their SHA1 info). */
     if (t && t->left)
 	if (write_treap(t->left, buf, NULL) < 0)
 	    return -1; /* failure. */
     if (t && t->right)
 	if (write_treap(t->right, buf, NULL) < 0)
 	    return -1; /* failure. */
     /* Now write us.  Note t may == NULL for a zero-byte file. */
     if (write_one(t, buf) < 0)
 	return -1; /* failure. */
     /* Write back sha1, if wanted. */
     if (sha1)
 	memcpy(sha1, t ? (const char*)t->sha1 : compute_null_treap_hash(),
 	       sizeof(t->sha1));
     return 0;
}

/* EXPORTED FUNCTION: write the file open on file descriptor 'fd'
  * and described by 'ce' and 'st' to the object store.   Return
  * 0 on success, -1 on failure. */
/* This does the same thing as 'index_fd' in Linus' update-cache.c */
int
chunk_index_fd(struct cache_entry *ce, int fd, struct stat *st) {
     struct chunklist *cl;
     struct treap *t;
     char *in;

     /* We expect there to be 'file length / CHUNK_SIZE' chunks.  Over-estimate
      * a little, and do the initial chunk list allocation. */
     cl = create_chunklist(1 + ((3 * st->st_size) / (2 * CHUNK_SIZE)));
     /* Split the file into chunks. */
     in = "";
     if (st->st_size)
 	in = mmap(NULL, st->st_size, PROT_READ, MAP_PRIVATE, fd, 0);
     close(fd);
     if (in==MAP_FAILED) return -1;

     chunkify(cl, in, st->st_size);
     /* Build the treap. */
     t = build_treap(cl, 0, cl->num_items);
     assert(treap_valid(t));
     /* Compute all the hashes. */
     compute_treap_hashes(t, in);
     /* Now write all the pieces, updating SHA1 for this file in the process. */
     if (write_treap(t, in, ce->sha1) < 0)
 	return -1;
     /* Free everything; we're done. */
     free_treap(t);
     free_chunklist(cl);
     if (st->st_size)
 	munmap(in, st->st_size);
     return 0; /* success! */
}

/*** Functions to read a chunked file into a contiguous buffer. ***/

struct read_chunk {
     void *data, *chunk_data;
     ch_size_t chunk_size, total_size;
     struct read_chunk *left, *right;
};
static struct read_chunk *
read_chunk2(const unsigned char *sha1, void *data, ch_size_t size);

static struct read_chunk *
read_chunk(const unsigned char *sha1) {
     void *data;
     ch_size_t size;
     char type[10];
     data = read_sha1_file(sha1, type, &size);
     assert(strcmp(type, "chunk")==0);
     return read_chunk2(sha1, data, size); 
}
static struct read_chunk *
read_chunk2(const unsigned char *sha1, void *data, ch_size_t size) {
     unsigned char *cp;
     struct read_chunk *result = malloc(sizeof(*result));
     cp = result->data = data;
     printf("CHUNK %s (%lu bytes)\n", sha1_to_hex(sha1), size);
     /* Parse the chunk data. */
     result->left = result->right = NULL;
     if (*cp++) {
 	result->left = read_chunk(cp); cp+=20;
     }
     if (*cp++) {
 	result->right = read_chunk(cp); cp+=20;
     }
     result->chunk_data = cp;
     result->chunk_size = size - (result->chunk_data - result->data);
     result->total_size = result->chunk_size +
 	(result->left ? result->left->total_size : 0) +
 	(result->right ? result->right->total_size : 0);
     return result;
}
static void
copy_read_chunk(void *dest, struct read_chunk *rc) {
     if (rc->left) {
 	copy_read_chunk(dest, rc->left);
 	dest += rc->left->total_size;
     }
     memcpy(dest, rc->chunk_data, rc->chunk_size);
     if (rc->right)
 	copy_read_chunk(dest + rc->chunk_size, rc->right);
}
static void
free_read_chunk(struct read_chunk *rc) {
     if (rc->left) free_read_chunk(rc->left);
     if (rc->right) free_read_chunk(rc->right);
     free(rc->data);
     free(rc);
}

/* This does the same thing as 'read_sha1_file' in Linus' read_cache.c,
  * except that it knows about the 'chunk' encoding and will transparently
  * stitch together the appropriate prefix and suffix chunks and pass it
  * off as a 'blob'. */
void *
chunk_read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size) {
     struct read_chunk *rc;
     void *result = read_sha1_file(sha1, type, size);
     if (strcmp(type, "chunk")!=0) return result;
     /* This is a 'chunk' object; get the rest of the pieces. */
     rc = read_chunk2(sha1, result, *size);
     /* Now concatenate them together. */
     strcpy(type, "blob");
     *size = rc->total_size;
     result = malloc(*size);
     copy_read_chunk(result, rc);
     /* done! */
     free_read_chunk(rc);
     return result;
}

#if 0
/* Exercise this code. */
int main(int argc, char **argv) {
     struct cache_entry ce;
     struct stat st;
     char *buf, type[10];
     unsigned long size;
     int fd;
     fd = open(argv[1], O_RDONLY);
     if (fd < 0) exit(1);
     if (fstat(fd, &st) < 0) exit(1);
     if (chunk_index_fd(&ce, fd, &st) < 0) exit(1);
     printf("Wrote file %s.\n", sha1_to_hex(ce.sha1));
     /* seemed to work! */
     buf = chunk_read_sha1_file(ce.sha1, type, &size);
     if (!buf) exit(1);
     printf("Read file %s, of type %s (%lu bytes):\n",
 	   sha1_to_hex(ce.sha1), type, size);
     fwrite(buf, size, 1, stdout);
     /* done! */
     return 0;
}
#endif

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Chris Mason @ 2005-04-20 17:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504200957030.6467@ppc970.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 1390 bytes --]

On Wednesday 20 April 2005 13:06, Linus Torvalds wrote:
> On Wed, 20 Apr 2005, Chris Mason wrote:
> > At any rate, the time for a single write-tree is pretty consistent. 
> > Before it was around .5 seconds, and with this change it goes down to
> > .128s.
>
> Oh, wow.
>
> I bet your SHA1 implementation is done with hand-optimized and scheduled
> x86 MMX code or something, while my poor G5 is probably using some slow
> generic routine. As a result, it only improved by 33% for me since the
> compression was just part of the picture, but with your cheap SHA1 the
> compression costs really dominated, and so it's almost four times faster
> for you.

Aha, I was wondering why your write-tree speeds sounded so bad...this athlon 
machine is ~2years old now.

Your comments about costs for writing the index file got me thinking, so I 
benchmarked how long the update-cache takes if we don't do the sha1 of the 
index file.  There was almost no difference at all.  update-cache currently 
takes about .152 seconds

The code to write the cache calls write() for every cache entry, writing just 
a few bytes at a time.  I changed it to collect these into a 16k buffer, 
which brings me down to .044s.  This might not help as much on ext23, since 
they are faster than reiser for tiny writes.

The patch below with your current tree brings my 100 patch test down to 22 
seconds again.

-chris

[-- Attachment #2: read-cache-fast.diff --]
[-- Type: text/x-diff, Size: 1104 bytes --]

--- linus.back/read-cache.c	2005-04-20 10:14:23.268310000 -0400
+++ linus/read-cache.c	2005-04-20 13:05:13.200083672 -0400
@@ -232,11 +232,12 @@
 	SHA_CTX c;
 	struct cache_header hdr;
 	int i;
+	char *buf;
+	int len = 0;
 
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
 	hdr.hdr_version = htonl(1);
 	hdr.hdr_entries = htonl(entries);
-
 	SHA1_Init(&c);
 	SHA1_Update(&c, &hdr, offsetof(struct cache_header, sha1));
 	for (i = 0; i < entries; i++) {
@@ -246,13 +247,31 @@
 	}
 	SHA1_Final(hdr.sha1, &c);
 
+	buf = malloc(16384);
+	if (!buf) {
+		return -1;
+	}
 	if (write(newfd, &hdr, sizeof(hdr)) != sizeof(hdr))
 		return -1;
 
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
 		int size = ce_size(ce);
-		if (write(newfd, ce, size) != size)
+		if (size > 16384) {
+			if (write(newfd, ce, size) != size)
+				return -1;
+			continue;
+		}
+		if (len + size > 16384) {
+			if (write(newfd, buf, len) != len)
+				return -1;
+			len = 0;
+		}
+		memcpy(buf + len, ce, size);
+		len += size;
+	}
+	if (len) {
+		if (write(newfd, buf, len) != len)
 			return -1;
 	}
 	return 0;

^ permalink raw reply

* Re: [PATCH] Some documentation...
From: C. Scott Ananian @ 2005-04-20 17:24 UTC (permalink / raw)
  To: David Greaves; +Cc: git
In-Reply-To: <42668C8D.3000209@dgreaves.com>

On Wed, 20 Apr 2005, David Greaves wrote:

> In doing this I noticed a couple of points:
> * update-cache won't accept ./file or fred/./file

The comment in update-cache.c reads:
/*
  * We fundamentally don't like some paths: we don't want
  * dot or dot-dot anywhere, and in fact, we don't even want
  * any other dot-files (.git or anything else). They
  * are hidden, for chist sake.
  *
  * Also, we don't want double slashes or slashes at the
  * end that can make pathnames ambiguous.
  */

It could be argued that './' is a special case... but at the moment this 
is definitely a designed 'feature' not a 'bug'.
  --scott

BLUEBIRD SEQUIN SECANT Waihopai Honduras KUDOVE genetic KUJUMP SCRANTON 
DES AMLASH Indonesia SLINC cracking ESMERALDITE mustard Uzi KUSODA
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: [script] ge: export commits as patches
From: Zlatko Calusic @ 2005-04-20 17:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Petr Baudis, git
In-Reply-To: <20050419134843.GA19146@elte.hu>

Ingo Molnar <mingo@elte.hu> writes:

> TREE1=$(cat-file commit 2>/dev/null $1 | head -4 | grep ^tree | cut -d' ' -f2)
                                         --------------------------------------

And to make it easier on your eyes, you can always rewrite stuff like
that (mentioned everywhere these days :)) like:

TREE1=$(cat-file commit 2>/dev/null $1 | awk '/^tree/ {print $2}'
                                         ------------------------

No, I'm definitely not trying to save some CPU cycles, CPU cycles are
cheap, eyes are expensive! :)
-- 
Zlatko

^ permalink raw reply

* RE: missing: git api, reference, user manual and mission statement
From: Andrew Timberlake-Newell @ 2005-04-20 17:15 UTC (permalink / raw)
  To: git; +Cc: 'Petr Baudis'
In-Reply-To: <20050419165809.GE12757@pasky.ji.cz>

Petr Baudis graced us with:
> Dear diary, on Tue, Apr 19, 2005 at 02:36:32PM CEST, I got a letter
> where Klaus Robert Suetterlin <robert@mpe.mpg.de> told me that...
> > 1) There is no clear (e.g. by name) distinction between ``git as done
> > by Linus'', which is a kind of content addressable database with added
> > semantics, and ``git as done by the rest of You'', which is a kind of
> > SCM on top of Linuses stuff.
> 
> There is git and git-pasky (git-pasky is superset; therefore various
> patches floating around either get to git-pasky or to both). I'm not
> sure what else do you mean.

This goes back to the question of whether to rename git-pasky to cogito.  

Perhaps the crucial question is:  will the git plumbing be used for anything
other than SCM?

If so, then it could be useful to differentiate by program name, so that we
would know whether another project was utilizing git-plumbing or git-SCM.

If not, then there is effectively only one tool and the plumbing is a
[crucial] portion thereof:  a git (SCM and the file system around which it
was built).

So what's the answer to the question?  Anyone planning to use git (the file
system) outside of the SCM?



^ permalink raw reply

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
From: duchier @ 2005-04-20 17:15 UTC (permalink / raw)
  To: Tom Lord; +Cc: gnu-arch-users, gnu-arch-dev, talli, git, torvalds
In-Reply-To: <200504201000.DAA04988@emf.net>

Hi Tom,

just as a datapoint, here is an experiment I carried out.  I wanted to evaluate
how much overhead is incurred by using several levels of directories to
implement a discrimating index.  I used the key format you specified:

	SHA1,SIZE

As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
1389 directories.  To compute the key for a directory, I considered that its
contents were a mapping from names to keys.

When constructing the indexed archive, I actually stored empty files instead of
blobs because I am only interested in overhead.

Using your suggested indexing method that uses [0:4] as the 1st level key and
[4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
where the top level contains 18665 1st level keys, the largest first level dir
contains 5 entries, and all 2nd level dirs contain exactly 1 entry.

Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
occupies 1.8M, where the top level contains 256 1st level keys, and where the
largest 1st level dir contains 110 entries.

This experiment was performed on an ext3 file system.

Cheers,

--Denys

^ permalink raw reply

* Re: [darcs-devel] Darcs and git: plan of action
From: Ralph Corderoy @ 2005-04-20 17:11 UTC (permalink / raw)
  To: Ray Lee; +Cc: Kevin Smith, git, darcs-devel
In-Reply-To: <1113951972.29444.42.camel@orca.madrabbit.org>


Hi Ray,

> Give me a case where assuming it's a replace will do the wrong thing,
> for C code, where it's a variable or function name.

How about two patches.

    1.  s/foo/bar/ throughout file because foo() has been decided upon
    as the name of a new globally visible forthcoming function but was
    already in use as a static function.

    2.  Add definition of new foo().

Patch 1 mustn't be a `darcs replace' despite it changing every occurence
of the C token foo into bar.

Cheers,


Ralph.


^ permalink raw reply

* Re: [GIT PATCH] I2C and W1 bugfixes for 2.6.12-rc2
From: Linus Torvalds @ 2005-04-20 17:15 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Git Mailing List
In-Reply-To: <dnbr898k0k.fsf@magla.zg.iskon.hr>



On Wed, 20 Apr 2005, Zlatko Calusic wrote:
> 
> I see this -avz incantation mentioned everytime when rsync is
> involved. But, is the -z part (compression) really necessary knowing
> that we're dealing with an already compressed tree? Doesn't it put
> additional strain on the rsync server without any benefit in this
> case?
> 
> Or I might be too ignorant and not understand some internals well, but
> then... I would like to know the reason. :)

I'm not a big rsync user, so I just copied the examples of others.

You're right, for git, you should not use compression for files (I don't 
know if rsync compresses the directory listings by default, I assume it 
does). 

		Linus

^ permalink raw reply

* [PATCH] Some documentation...
From: David Greaves @ 2005-04-20 17:08 UTC (permalink / raw)
  To: Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 587 bytes --]

Hi

I'm starting to write some docs...
Comments... even "yep, looks OK, carry on" :)

I plan on putting the 'git command' ones into the 'git help ...' 
structure once Petr accepts it.
I guess the low level ones go into a README.reference until they 
stabilise and become man pages...

In doing this I noticed a couple of points:
* update-cache won't accept ./file or fred/./file
* checkout-cache doesn't seem to preserve mode

Are these bugs or should they be documented?
I've taken the approach of documenting behaviour for now.

Signed-off-by: David Greaves <david@dgreaves.com>
---



[-- Attachment #2: README.reference.patch --]
[-- Type: text/x-patch, Size: 4597 bytes --]

Index: README.reference
===================================================================
--- /dev/null  (tree:cf6a46a2199777c3dac32fa4479b97c0752cdf07)
+++ 30de093673d44c7ea8c56a0194fb792e47225ac8/README.reference  (mode:100644 sha1:2ec6683b22e5672ea46d27770fcb1a4b4c37aa0e)
@@ -0,0 +1,158 @@
+Terminology: - see README for description
+Each line contains terms used interchangeably
+
+object database, .git directory
+directory cache, index
+id, sha1, sha1-id, sha1 hash
+type, tag, tagname
+blob, blob object
+tree, tree object
+commit, commit object
+parent
+root object
+changeset
+
+################################################################
+cat-file
+	cat-file <-t | tagname> <sha1>
+
+Provide contents or type of objects in the repository. The tagname is
+required if it is not being interrogated.
+
+
+<sha1>
+	The sha1 identifier of the object.
+	(This is the sha1 of the uncompressed content.)
+
+-t
+	show the object type identified by <sha1>
+	One of: blob/tree/commit
+
+<tagname>
+	One of: blob/tree/commit
+
+
+################################################################
+check-files
+	check-files <file>...
+
+Check that a list of files are up-to-date between the filesystem and
+the cache. Used to verify a patch target before doing a patch.
+
+Files that do not exist on the filesystem are considered up-to-date
+(whether or not they are in the cache).
+
+Emits an error message on failure.
+
+exits with a status code indicating success if all files are
+up-to-date.
+
+
+see also: update-cache
+
+
+################################################################
+checkout-cache
+	checkout-cache [-q] [-a] [-f] [--] <file>...
+
+Will copy all files listed from the cache to the working directory
+(not overwriting existing files). Note that the file contents are
+restored - NOT the file permissions.
+
+-q
+	be quiet if files exist or are not in the cache
+
+-f
+	forces overwrite of existing files
+
+-a
+	checks out all files in the cache before processing listed
+	files.
+
+Note that the order of the flags matters:
+
+	checkout-cache -a -f file.c
+
+will first check out all files listed in the cache (but not overwrite
+any old ones), and then force-checkout file.c a second time (ie that
+one _will_ overwrite any old contents with the same filename).
+
+Also, just doing "checkout-cache" does nothing. You probably meant
+"checkout-cache -a". And if you want to force it, you want
+"checkout-cache -f -a".
+
+Intuitiveness is not the goal here. Repeatability is. The reason for
+the "no arguments means no work" thing is that from scripts you are
+supposed to be able to do things like
+
+	find . -name '*.h' -print0 | xargs -0 checkout-cache -f --
+
+which will force all existing *.h files to be replaced with their
+cached copies. If an empty command line implied "all", then this would
+force-refresh everything in the cache, which was not the point.
+
+Oh, and the "--" is just a good idea when you know the rest will be
+filenames. Just so that you wouldn't have a filename of "-a" causing
+problems (not possible in the above example, but get used to it in
+scripting!).
+
+
+################################################################
+commit-id
+	commit-id [tag]
+
+Returns the sha1-id of the commit object associated with given tag.
+
+tag
+	tag of commit object - defaults to the current HEAD.
+
+
+################################################################
+commit-tree
+	commit-tree <sha1> [-p <sha1>]* < changelog
+
+
+################################################################
+diff-tree
+	diff-tree [-r] [-z] <tree sha1> <tree sha1>
+
+
+################################################################
+ls-tree
+	ls-tree [-r] [-z] <key>
+
+
+################################################################
+merge-base
+	merge-base <commit-id> <commit-id>
+
+
+################################################################
+merge-cache
+	merge-cache <merge-program> (-a | <filename>*)
+
+
+################################################################
+read-tree
+	read-tree [-m] <sha1>
+
+
+################################################################
+rev-tree
+	rev-tree [--edges] [--cache <cache-file>] <commit-id> [<commit-id>]
+
+
+################################################################
+show-diff
+	show-diff [-q] [-s] [-z] [paths...]
+
+
+################################################################
+show-files
+	show-files [-z] [-t] (--[cached|deleted|others|ignored|stage])*
+
+
+################################################################
+unpack-file
+	unpack-file.c <sha1>
+

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Linus Torvalds @ 2005-04-20 17:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: git
In-Reply-To: <200504201237.38374.mason@suse.com>



On Wed, 20 Apr 2005, Chris Mason wrote:
> 
> At any rate, the time for a single write-tree is pretty consistent.  Before it 
> was around .5 seconds, and with this change it goes down to .128s.

Oh, wow.

I bet your SHA1 implementation is done with hand-optimized and scheduled
x86 MMX code or something, while my poor G5 is probably using some slow
generic routine. As a result, it only improved by 33% for me since the
compression was just part of the picture, but with your cheap SHA1 the
compression costs really dominated, and so it's almost four times faster
for you.

Anyway, that's good. It definitely means that I consider tree writing to 
be "fast enough". You can commit patches in a third of a second on your 
machine.

I'll consider the problem solved for now. Yeah, I realize that it still 
takes you half a minute to commit the 100 quilt patches, but I just can't 
bring myself to think it's a huge problem in the kind of usage patterns I 
think are realistic.

If somebody really wants to replace quilt with git, he'd need to spend
some effort on it. If you just want to work together reasonably well, I
think 3 patches per second is pretty much there.

			Linus

^ permalink raw reply

* Re: [GIT PATCH] I2C and W1 bugfixes for 2.6.12-rc2
From: Zlatko Calusic @ 2005-04-20 16:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0504191525290.2274@ppc970.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Real merges have no patches taking place _anywhere_. And they take about 
> half a second. Doing an "update" of your tree should _literally_ boil down 
> to
>
> 	#
> 	# "repo" needs to point to the repo we update from
> 	#
> 	rsync -avz --ignore-existing $repo/objects/. .git/objects/.

I see this -avz incantation mentioned everytime when rsync is
involved. But, is the -z part (compression) really necessary knowing
that we're dealing with an already compressed tree? Doesn't it put
additional strain on the rsync server without any benefit in this
case?

Or I might be too ignorant and not understand some internals well, but
then... I would like to know the reason. :)

Regards,
-- 
Zlatko

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Linus Torvalds @ 2005-04-20 16:41 UTC (permalink / raw)
  To: Chris Mason; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504200931020.6467@ppc970.osdl.org>



On Wed, 20 Apr 2005, Linus Torvalds wrote:
> 
> NO! Don't see if this works. For the "sha1 file already exists" file, it 
> forgot to return the SHA1 value in "returnsha1", and would thus corrupt 
> the trees it wrote.

Proper version with fixes checked in. For me, it brings down the time to
write a kernel tree from 0.34s to 0.24s, so a third of the time was just
compressing objects that we ended up already having.

Two thirds to go ;)

		Linus

^ permalink raw reply

* Re: [PATCH] write-tree performance problems
From: Chris Mason @ 2005-04-20 16:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504200833580.6467@ppc970.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On Wednesday 20 April 2005 11:40, Linus Torvalds wrote:
> On Wed, 20 Apr 2005, Chris Mason wrote:
> > Thanks for looking at this.  Your new tree is faster, it gets the commit
> > 100 patches time down from 1m5s to 50s.
>
> It really _shouldn't_ be faster. It still does the compression, and throws
> the end result away.

Well, that's a little odd.  I had thought about making sure you did this 
change and forgotten.  1 minute benchmarks are a horrible idea since they run 
into noise with cache writebacks.  I should know better...

At any rate, the time for a single write-tree is pretty consistent.  Before it 
was around .5 seconds, and with this change it goes down to .128s.  My patch 
was .024.

The 100 patch time is down to 32s (3 run average).  This is close enough that 
I don't think my patch is worth it if no other part of git can benefit from 
having trees in the index.

>
> To actually go faster, it _should_ need this patch. Untested. See if it
> works..

Thanks. This one missed the filling in the returnsha1.  New patch attached.

-chris

[-- Attachment #2: file-check.diff --]
[-- Type: text/x-diff, Size: 1059 bytes --]

diff -u linus.back/sha1_file.c linus/sha1_file.c
--- linus.back/sha1_file.c	2005-04-20 12:31:00.240181016 -0400
+++ linus/sha1_file.c	2005-04-20 12:13:56.339837528 -0400
@@ -173,12 +173,27 @@
 	z_stream stream;
 	unsigned char sha1[20];
 	SHA_CTX c;
+	char *filename;
+	int fd;
 
 	/* Sha1.. */
 	SHA1_Init(&c);
 	SHA1_Update(&c, buf, len);
 	SHA1_Final(sha1, &c);
 
+	filename = sha1_file_name(sha1);
+	fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
+	if (fd < 0) {
+		if (errno != EEXIST)
+			return -1;
+
+		/*
+		 * We might do collision checking here, but we'd need to
+		 * uncompress the old file and check it. Later.
+		 */
+		goto out;
+	}
+
 	/* Set it up */
 	memset(&stream, 0, sizeof(stream));
 	deflateInit(&stream, Z_BEST_COMPRESSION);
@@ -195,8 +210,10 @@
 	deflateEnd(&stream);
 	size = stream.total_out;
 
-	if (write_sha1_buffer(sha1, compressed, size) < 0)
-		return -1;
+	if (write(fd, compressed, size) != size)
+		die("unable to write file");
+	close(fd);
+out:		
 	if (returnsha1)
 		memcpy(returnsha1, sha1, 20);
 	return 0;

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox