fast-import and unique objects.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fast-import and unique objects.
@ 2006-08-06 12:32 Jon Smirl
  2006-08-06 15:53 ` Jon Smirl
  0 siblings, 1 reply; 15+ messages in thread
From: Jon Smirl @ 2006-08-06 12:32 UTC (permalink / raw)
  To: git, Shawn Pearce

This model has a lot of object duplication. I generated 949,305
revisions, but only 754,165 are unique. I'll modify my code to build a
hash of the objects it has seen and then not send the duplicates to
fast-import. Those 195,140 duplicated objects may be what is tripping
index-pack up.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-06 12:32 fast-import and unique objects Jon Smirl
@ 2006-08-06 15:53 ` Jon Smirl
  2006-08-06 18:03   ` Shawn Pearce
  0 siblings, 1 reply; 15+ messages in thread
From: Jon Smirl @ 2006-08-06 15:53 UTC (permalink / raw)
  To: git, Shawn Pearce

On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> This model has a lot of object duplication. I generated 949,305
> revisions, but only 754,165 are unique. I'll modify my code to build a
> hash of the objects it has seen and then not send the duplicates to
> fast-import. Those 195,140 duplicated objects may be what is tripping
> index-pack up.

New run is finished with duplicate removal.

Time to run is unchanged, still 2hrs. Run time is IO bound not CPU.
Pack file is 845MB instead of 934MB.
git-index-pack works now, it takes 4 CPU minutes to run.
Index file is 18MB.

So it looks like the first stage code is working. Next I need to
modify cvs2svn to keep track of the sha-1 through it's sorting process
instead of file:revision.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-06 15:53 ` Jon Smirl
@ 2006-08-06 18:03   ` Shawn Pearce
  2006-08-07  4:48     ` Jon Smirl
  2006-08-07  7:57     ` Ryan Anderson
  0 siblings, 2 replies; 15+ messages in thread
From: Shawn Pearce @ 2006-08-06 18:03 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >This model has a lot of object duplication. I generated 949,305
> >revisions, but only 754,165 are unique. I'll modify my code to build a
> >hash of the objects it has seen and then not send the duplicates to
> >fast-import. Those 195,140 duplicated objects may be what is tripping
> >index-pack up.
> 
> New run is finished with duplicate removal.
> 
> Time to run is unchanged, still 2hrs. Run time is IO bound not CPU.
> Pack file is 845MB instead of 934MB.
> git-index-pack works now, it takes 4 CPU minutes to run.
> Index file is 18MB.

I'm attaching a new version of fast-import.c which generates the
index, and does duplicate removal.  However I think that it might
be slightly faster for you to do the duplicate removal in Python
as it saves the user-kernel-user copy of the file data.  Even so,
this new version should save you those 4 CPU minutes as the index
will be generated from the in-memory SHA1s rather than needing to
recompute them.

I've changed the calling convention:

  - It now takes the pack's base name as its first parameter. It
    appends ".pack" and ".idx" to form the actual file names its
    writing to.

  - It expects an estimated object count as its second parameter.
    In your case this would be something around 760000.  This tells
    it how large of an object table to allocate, with each entry
    being 24 bytes + 1 pointer (28 or 32 bytes).  Overshooting
	this number will cause it to degrade by allocating one
	overflow entry at a time from malloc.

So the new version should take about 20 MB of memory and should
produce a valid pack and index in the same time as it does only
the pack now.  Plus it won't generate duplicates.
 
> So it looks like the first stage code is working. Next I need to
> modify cvs2svn to keep track of the sha-1 through it's sorting process
> instead of file:revision.

When you get down to tree writing and commit writing we might want
to do something similiar with the trees and commits.  I can modify
fast-import to also store those off into a pack.

-- 
Shawn.

[-- Attachment #2: fast-import.c --]
[-- Type: text/x-csrc, Size: 8018 bytes --]

#include "builtin.h"
#include "cache.h"
#include "object.h"
#include "blob.h"
#include "delta.h"
#include "pack.h"
#include "csum-file.h"

static int max_depth = 10;
static unsigned long object_count;
static unsigned long duplicate_count;
static unsigned long packoff;
static unsigned long overflow_count;
static int packfd;
static int current_depth;
static void *lastdat;
static unsigned long lastdatlen;
static unsigned char lastsha1[20];
static unsigned char packsha1[20];

struct object_entry
{
	struct object_entry *next;
	unsigned long offset;
	unsigned char sha1[20];
};

struct overflow_object_entry
{
	struct overflow_object_entry *next;
	struct object_entry oe;
};

struct object_entry *pool_start;
struct object_entry *pool_next;
struct object_entry *pool_end;
struct overflow_object_entry *overflow;
struct object_entry *table[1 << 16];

static struct object_entry* new_object(unsigned char *sha1)
{
	if (pool_next != pool_end) {
		struct object_entry *e = pool_next++;
		memcpy(e->sha1, sha1, sizeof(e->sha1));
		return e;
	} else {
		struct overflow_object_entry *e;

		e = xmalloc(sizeof(struct overflow_object_entry));
		e->next = overflow;
		memcpy(e->oe.sha1, sha1, sizeof(e->oe.sha1));
		overflow = e;
		overflow_count++;
		return &e->oe;
	}
}

static struct object_entry* insert_object(unsigned char *sha1)
{
	unsigned int h = sha1[0] << 8 | sha1[1];
	struct object_entry *e = table[h];
	struct object_entry *p = 0;

	while (e) {
		if (!memcmp(sha1, e->sha1, sizeof(e->sha1)))
			return e;
		p = e;
		e = e->next;
	}

	e = new_object(sha1);
	e->next = 0;
	e->offset = 0;
	if (p)
		p->next = e;
	else
		table[h] = e;
	return e;
}

static ssize_t yread(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xread(fd, (char *) buffer + ret, length - ret);
		if (size < 0) {
			return size;
		}
		if (size == 0) {
			return ret;
		}
		ret += size;
	}
	return ret;
}

static ssize_t ywrite(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret);
		if (size < 0) {
			return size;
		}
		if (size == 0) {
			return ret;
		}
		ret += size;
	}
	return ret;
}

static unsigned long encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
{
	int n = 1;
	unsigned char c;

	if (type < OBJ_COMMIT || type > OBJ_DELTA)
		die("bad type %d", type);

	c = (type << 4) | (size & 15);
	size >>= 4;
	while (size) {
		*hdr++ = c | 0x80;
		c = size & 0x7f;
		size >>= 7;
		n++;
	}
	*hdr = c;
	return n;
}

static void write_blob(void *dat, unsigned long datlen)
{
	z_stream s;
	void *out, *delta;
	unsigned char hdr[64];
	unsigned long hdrlen, deltalen;

	if (lastdat && current_depth < max_depth) {
		delta = diff_delta(lastdat, lastdatlen,
			dat, datlen,
			&deltalen, 0);
	} else
		delta = 0;

	memset(&s, 0, sizeof(s));
	deflateInit(&s, zlib_compression_level);

	if (delta) {
		current_depth++;
		s.next_in = delta;
		s.avail_in = deltalen;
		hdrlen = encode_header(OBJ_DELTA, deltalen, hdr);
		if (ywrite(packfd, hdr, hdrlen) != hdrlen)
			die("Can't write object header: %s", strerror(errno));
		if (ywrite(packfd, lastsha1, sizeof(lastsha1)) != sizeof(lastsha1))
			die("Can't write object base: %s", strerror(errno));
		packoff += hdrlen + sizeof(lastsha1);
	} else {
		current_depth = 0;
		s.next_in = dat;
		s.avail_in = datlen;
		hdrlen = encode_header(OBJ_BLOB, datlen, hdr);
		if (ywrite(packfd, hdr, hdrlen) != hdrlen)
			die("Can't write object header: %s", strerror(errno));
		packoff += hdrlen;
	}

	s.avail_out = deflateBound(&s, s.avail_in);
	s.next_out = out = xmalloc(s.avail_out);
	while (deflate(&s, Z_FINISH) == Z_OK)
		/* nothing */;
	deflateEnd(&s);

	if (ywrite(packfd, out, s.total_out) != s.total_out)
		die("Failed writing compressed data %s", strerror(errno));
	packoff += s.total_out;

	free(out);
	if (delta)
		free(delta);
}

static void init_pack_header()
{
	const char* magic = "PACK";
	unsigned long version = 2;
	unsigned long zero = 0;

	version = htonl(version);

	if (ywrite(packfd, (char*)magic, 4) != 4)
		die("Can't write pack magic: %s", strerror(errno));
	if (ywrite(packfd, &version, 4) != 4)
		die("Can't write pack version: %s", strerror(errno));
	if (ywrite(packfd, &zero, 4) != 4)
		die("Can't write 0 object count: %s", strerror(errno));
	packoff = 4 * 3;
}

static void fixup_header_footer()
{
	SHA_CTX c;
	char hdr[8];
	unsigned long cnt;
	char *buf;
	size_t n;

	if (lseek(packfd, 0, SEEK_SET) != 0)
		die("Failed seeking to start: %s", strerror(errno));

	SHA1_Init(&c);
	if (yread(packfd, hdr, 8) != 8)
		die("Failed reading header: %s", strerror(errno));
	SHA1_Update(&c, hdr, 8);

	cnt = htonl(object_count);
	SHA1_Update(&c, &cnt, 4);
	if (ywrite(packfd, &cnt, 4) != 4)
		die("Failed writing object count: %s", strerror(errno));

	buf = xmalloc(128 * 1024);
	for (;;) {
		n = xread(packfd, buf, 128 * 1024);
		if (n <= 0)
			break;
		SHA1_Update(&c, buf, n);
	}
	free(buf);

	SHA1_Final(packsha1, &c);
	if (ywrite(packfd, packsha1, sizeof(packsha1)) != sizeof(packsha1))
		die("Failed writing pack checksum: %s", strerror(errno));
}

static int oecmp (const void *_a, const void *_b)
{
	struct object_entry *a = *((struct object_entry**)_a);
	struct object_entry *b = *((struct object_entry**)_b);
	return memcmp(a->sha1, b->sha1, sizeof(a->sha1));
}

static void write_index(const char *idx_name)
{
	struct sha1file *f;
	struct object_entry **idx, **c, **last;
	struct object_entry *e;
	struct overflow_object_entry *o;
	unsigned int array[256];
	int i;

	/* Build the sorted table of object IDs. */
	idx = xmalloc(object_count * sizeof(struct object_entry*));
	c = idx;
	for (e = pool_start; e != pool_next; e++)
		*c++ = e;
	for (o = overflow; o; o = o->next)
		*c++ = &o->oe;
	last = idx + object_count;
	qsort(idx, object_count, sizeof(struct object_entry*), oecmp);

	/* Generate the fan-out array. */
	c = idx;
	for (i = 0; i < 256; i++) {
		struct object_entry **next = c;;
		while (next < last) {
			if ((*next)->sha1[0] != i)
				break;
			next++;
		}
		array[i] = htonl(next - idx);
		c = next;
	}

	f = sha1create("%s", idx_name);
	sha1write(f, array, 256 * sizeof(int));
	for (c = idx; c != last; c++) {
		unsigned int offset = htonl((*c)->offset);
		sha1write(f, &offset, 4);
		sha1write(f, (*c)->sha1, sizeof((*c)->sha1));
	}
	sha1write(f, packsha1, sizeof(packsha1));
	sha1close(f, NULL, 1);
	free(idx);
}

int main(int argc, const char **argv)
{
	const char *base_name = argv[1];
	int est_obj_cnt = atoi(argv[2]);
	char *pack_name;
	char *idx_name;

	pack_name = xmalloc(strlen(base_name) + 6);
	sprintf(pack_name, "%s.pack", base_name);
	idx_name = xmalloc(strlen(base_name) + 5);
	sprintf(idx_name, "%s.idx", base_name);

	packfd = open(pack_name, O_RDWR|O_CREAT|O_TRUNC, 0666);
	if (packfd < 0)
		die("Can't create pack file %s: %s", pack_name, strerror(errno));

	pool_start = xmalloc(est_obj_cnt * sizeof(struct object_entry));
	pool_next = pool_start;
	pool_end = pool_start + est_obj_cnt;

	init_pack_header();
	for (;;) {
		unsigned long datlen;
		int hdrlen;
		void *dat;
		char hdr[128];
		unsigned char sha1[20];
		SHA_CTX c;
		struct object_entry *e;

		if (yread(0, &datlen, 4) != 4)

			break;

		dat = xmalloc(datlen);
		if (yread(0, dat, datlen) != datlen)
			break;

		hdrlen = sprintf(hdr, "blob %lu", datlen) + 1;
		SHA1_Init(&c);
		SHA1_Update(&c, hdr, hdrlen);
		SHA1_Update(&c, dat, datlen);
		SHA1_Final(sha1, &c);

		e = insert_object(sha1);
		if (!e->offset) {
			e->offset = packoff;
			write_blob(dat, datlen);
			object_count++;
			printf("%s\n", sha1_to_hex(sha1));
			fflush(stdout);

			if (lastdat)
				free(lastdat);
			lastdat = dat;
			lastdatlen = datlen;
			memcpy(lastsha1, sha1, sizeof(sha1));
		} else {
			duplicate_count++;
			free(dat);
		}
	}
	fixup_header_footer();
	close(packfd);
	write_index(idx_name);

	fprintf(stderr, "%lu objects, %lu duplicates, %lu pool overflow\n",
		object_count, duplicate_count, overflow_count);

	return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-06 18:03   ` Shawn Pearce
@ 2006-08-07  4:48     ` Jon Smirl
  2006-08-07  5:04       ` Shawn Pearce
  2006-08-07  5:10       ` Martin Langhoff
  2006-08-07  7:57     ` Ryan Anderson
  1 sibling, 2 replies; 15+ messages in thread
From: Jon Smirl @ 2006-08-07  4:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote:
> So the new version should take about 20 MB of memory and should
> produce a valid pack and index in the same time as it does only
> the pack now.  Plus it won't generate duplicates.

I did a run with this and it works great.

I'm staring at the cvs2svn code now trying to figure out how to modify
it without rewriting everything. I may just leave it all alone and
build a table with cvs_file:rev to sha-1 mappings. It would be much
more efficient to carry sha-1 throughout the stages but that may
require significant rework.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07  4:48     ` Jon Smirl
@ 2006-08-07  5:04       ` Shawn Pearce
  2006-08-07 14:37         ` Jon Smirl
  2006-08-07  5:10       ` Martin Langhoff
  1 sibling, 1 reply; 15+ messages in thread
From: Shawn Pearce @ 2006-08-07  5:04 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote:
> >So the new version should take about 20 MB of memory and should
> >produce a valid pack and index in the same time as it does only
> >the pack now.  Plus it won't generate duplicates.
> 
> I did a run with this and it works great.

Good.  :-) On my drive in to work this afternoon I realized
that making you specify the size of the object table is stupid,
I could easily allocate a thousand objects at a time rather than
preallocating the whole thing.  Oh well.  fast-import thus far
hasn't been meant as production code for inclusion in core GIT,
but maybe it will get cleaned up and submitted as such if your
conversion efforts go well and produce a better CVS importer.

> I'm staring at the cvs2svn code now trying to figure out how to modify
> it without rewriting everything. I may just leave it all alone and
> build a table with cvs_file:rev to sha-1 mappings. It would be much
> more efficient to carry sha-1 throughout the stages but that may
> require significant rework.

Does it matter?  How long does the cvs2svn processing take,
excluding the GIT blob processing that's now known to take 2 hours?
What's your target for an acceptable conversion time on the system
you are working on?

Any thoughts yet on how you might want to feed trees and commits
to a fast pack writer?  I was thinking about doing a stream into
fast-import such as:

	<4 byte length of commit><commit><treeent>*<null>

where <commit> is the raw commit minus the first "tree nnn\n" line, and
<treeent> is:

	<type><sp><sha1><sp><path><null>

where <type> is one of 'B' (normal blob), 'L' (symlink), 'X'
(executable blob), <sha1> is the 40 byte hex, <path> is the file from
the root of the repository ("src/module/foo.c"), and <sp> and <null>
are the obvious values.  You would feed all tree entries and the pack
writer would split the stream up into the individual tree objects.

fast-import would generate the tree(s) delta'ing them against the
prior tree of the same path, prefix "tree nnn\n" to the commit
blob you supplied, generate the commit, and print out its ID.
By working from the first commit up to the most recent each tree
deltas would be using the older tree as the base which may not be
ideal if a large number of items get added to a tree but should be
effective enough to generate a reasonably sized initial pack.

It would however mean you need to monitor the output pipe from
fast-import to get back the commit id so you can use it to prep
the next commit's parent(s) as you can't produce that in Python.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07  5:04       ` Shawn Pearce
@ 2006-08-07 14:37         ` Jon Smirl
  2006-08-07 14:48           ` Jakub Narebski
  2006-08-08  3:12           ` Shawn Pearce
  0 siblings, 2 replies; 15+ messages in thread
From: Jon Smirl @ 2006-08-07 14:37 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote:
> > I'm staring at the cvs2svn code now trying to figure out how to modify
> > it without rewriting everything. I may just leave it all alone and
> > build a table with cvs_file:rev to sha-1 mappings. It would be much
> > more efficient to carry sha-1 throughout the stages but that may
> > require significant rework.
>
> Does it matter?  How long does the cvs2svn processing take,
> excluding the GIT blob processing that's now known to take 2 hours?
> What's your target for an acceptable conversion time on the system
> you are working on?

As is, it takes the code about a week to import MozCVS into
Subversion. But I've already addressed the core of why that was taking
so long. The original code forks off a copy of cvs for each revision
to exact the text. Doing that 1M times takes about two days. The
version with fast-import takes two hours.

At the end of the process cvs2svn forks off svn 250K times to import
the change sets. That takes about four days to finish. Doing a
fast-import backend should fix that.

> Any thoughts yet on how you might want to feed trees and commits
> to a fast pack writer?  I was thinking about doing a stream into
> fast-import such as:

The data I have generates an output that indicates add/change/delete
for each file name. Add/change should have an associated sha-1 for the
new revision. cvs/svn have no concept of trees.

How about sending out a stream of add/change/delete operations
interspersed with commits? That would let fast-import track the tree
and only generate tree nodes when they change.

The protocol may need some thought. I need to be able to handle
branches and labels too.


>         <4 byte length of commit><commit><treeent>*<null>
>
> where <commit> is the raw commit minus the first "tree nnn\n" line, and
> <treeent> is:
>
>         <type><sp><sha1><sp><path><null>
>
> where <type> is one of 'B' (normal blob), 'L' (symlink), 'X'
> (executable blob), <sha1> is the 40 byte hex, <path> is the file from
> the root of the repository ("src/module/foo.c"), and <sp> and <null>
> are the obvious values.  You would feed all tree entries and the pack
> writer would split the stream up into the individual tree objects.
>
> fast-import would generate the tree(s) delta'ing them against the
> prior tree of the same path, prefix "tree nnn\n" to the commit
> blob you supplied, generate the commit, and print out its ID.
> By working from the first commit up to the most recent each tree
> deltas would be using the older tree as the base which may not be
> ideal if a large number of items get added to a tree but should be
> effective enough to generate a reasonably sized initial pack.
>
> It would however mean you need to monitor the output pipe from
> fast-import to get back the commit id so you can use it to prep
> the next commit's parent(s) as you can't produce that in Python.
>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07 14:37         ` Jon Smirl
@ 2006-08-07 14:48           ` Jakub Narebski
  2006-08-07 18:45             ` Jon Smirl
  2006-08-08  3:12           ` Shawn Pearce
  1 sibling, 1 reply; 15+ messages in thread
From: Jakub Narebski @ 2006-08-07 14:48 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> How about sending out a stream of add/change/delete operations
> interspersed with commits? That would let fast-import track the tree
> and only generate tree nodes when they change.

The problem with CVS, which cvsps (CVS PatchSet) tries to address is that
changes are to a file, and reconstructing which changes go together to make
_one_ commit...

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07 14:48           ` Jakub Narebski
@ 2006-08-07 18:45             ` Jon Smirl
  0 siblings, 0 replies; 15+ messages in thread
From: Jon Smirl @ 2006-08-07 18:45 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On 8/7/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Jon Smirl wrote:
>
> > How about sending out a stream of add/change/delete operations
> > interspersed with commits? That would let fast-import track the tree
> > and only generate tree nodes when they change.
>
> The problem with CVS, which cvsps (CVS PatchSet) tries to address is that
> changes are to a file, and reconstructing which changes go together to make
> _one_ commit...

The cvs2svn code is also gathering CVS commits into change sets. The
advantage of the cvs2svn code is that it is written by some of the
original CVS developers. Those developers are familiar with the 1,000
different ways to break a CVS repository and cvs2svn tries to deal
with the breakage.

cvsps works for some repositories but it fails on the Mozilla one.
There is something about branches in the Mozilla repository that
confuses it. cvs2svn handles the Mozilla repository correctly as far
as I can tell.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07 14:37         ` Jon Smirl
  2006-08-07 14:48           ` Jakub Narebski
@ 2006-08-08  3:12           ` Shawn Pearce
  2006-08-08 12:11             ` Jon Smirl
  1 sibling, 1 reply; 15+ messages in thread
From: Shawn Pearce @ 2006-08-08  3:12 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> the change sets. That takes about four days to finish. Doing a
> fast-import backend should fix that.

Shouldn't be a problem.  :-)

> >Any thoughts yet on how you might want to feed trees and commits
> >to a fast pack writer?  I was thinking about doing a stream into
> >fast-import such as:
> 
> The data I have generates an output that indicates add/change/delete
> for each file name. Add/change should have an associated sha-1 for the
> new revision. cvs/svn have no concept of trees.
> 
> How about sending out a stream of add/change/delete operations
> interspersed with commits? That would let fast-import track the tree
> and only generate tree nodes when they change.
> 
> The protocol may need some thought. I need to be able to handle
> branches and labels too.

Knowing a little bit about SVN I would assume the current cvs2svn
code would issue commands like:

    svn copy A B;   # Make branch B starting from where A is now.
    svn switch B;   # Make branch B the current branch.
    svn add F;      # Add file F.
    svn rm Y;       # Delete file Y.
    svn commit;     # Commit current tree on current branch.
    svn copy A T;   # Create tag T from where A is now.

But I don't know how it would issue a merge commit.  Or even if it
could find such a thing in the RCS files.

The above command set would be rather trivial to implement in a
fast-import backend.  I'm thinking we extend the protocol so it
looks something like the following:

  stream ::= cmd*;

  cmd ::= new_blob
        | new_branch
        | set_branch
        | update
        | commit
        | tag
        ;

  new_blob    ::= 'blob' blob_data;
  new_branch  ::= 'newb' branch_name source_branch;
  set_branch  ::= 'setb' branch_name;
  add_file    ::= update_file;
  update_file ::= 'updf' file_update;
  delete_file ::= 'delf' file_delete;
  commit      ::= 'comt' author_committer_msg;
  tag         ::= 'tagg' branch_name tagger_msg;

  source_branch ::= branch_name
                  | zero32
                  ;
  branch_name ::= len32 branch_name_str;
  branch_name_str ::= # valid GIT branch name, should be relative
                      # to .git/ (so include a refs/heads prefix)
                    ;
  file_update ::= len32 mode sp hexsha1 sp path;
  file_delete ::= len32 path;

  blob_data ::= len32 binary_data;

  author_committer_msg ::= len32
    'author' sp name '<' email '>' ts tz lf
    'committer' sp name '<' email '>' ts tz lf
    lf
    binary_data;

  tagger_msg ::= len32
    'tagger' sp name '<' email '>' ts tz lf
    lf
    binary_data;

  len32 ::= # unsigned 32 bit value, native format;
  zero32 ::= # 32 bits, unset, aka '0';

  mode ::= 'P'    # plain file
         | 'X'    # executable file
         ;

  binary_data ::= # file content, not interpreted;
  sp ::= # ASCII space character;
  lf ::= # ASCII newline (LF) character;
  path ::= # GIT style file path, e.g. "a/b/c";
  hexsha1 ::= # SHA1 in hexadecimal format;
  nullsha1 ::= # 40 '0's in sequence;
  name ::= # valid GIT author/committer name;
  email ::= # valid GIT author/committer email;
  ts ::= # time since the epoch in seconds, ascii decimal;
  tz ::= # GIT style timezone;

This is a change from the current protocol as new blobs need to
get prefixed by the command 'blob'.  Note that all commands are 4
bytes wide and that any variable data portion of a command uses a
"Pascal style" leading 32 bit length.  Although ugly it just makes
it a tiny bit easier to code the stream parser.  :-)

The commits and tags are half generated in the frontend and half
in the backend.  The backend is producing and tracking the tree and
the current SHA1 of each branch.  Consequently it will generate the
'tree' and 'parent' lines of a commit or the 'object', 'type' and
'tag' lines of a tag.  This is limited to only a linear development
path, no merges...

The backend would need to run inside of an existing GIT repository
as it would output all tags and branch heads when it terminates.
(Right now it runs from anywhere.)

I don't know how many branches you would be asking the backend to
hold at once, but I was thinking that the backend would just create
a tree structure in memory when it receives a new_branch command,
and consider one of those to be the current branch when a set_branch
command is issued.  With all branches in memory at once switching
would be very cheap, but if the number of branches is high it could
eat memory like a pig...

Right now I'm thinking that a file entry in a tree would cost:

  29 bytes + length_of_name + malloc_overhead

while a single tree would cost that plus:

  36 bytes + 4*number_of_entries + 2*malloc_overhead + last_treesz

where last_treesz is last content of that tree (uncompressed),
so we can deltify against it quickly.

So what's your worst case number of files and directories in a
single commit?  And how many branches are we talking about carrying
around in the backend?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-08  3:12           ` Shawn Pearce
@ 2006-08-08 12:11             ` Jon Smirl
  2006-08-08 22:45               ` Shawn Pearce
  0 siblings, 1 reply; 15+ messages in thread
From: Jon Smirl @ 2006-08-08 12:11 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote:
> Knowing a little bit about SVN I would assume the current cvs2svn
> code would issue commands like:

We're designing a dumpfile format for git like the one SVN has.

>     svn copy A B;   # Make branch B starting from where A is now.
>     svn switch B;   # Make branch B the current branch.
>     svn add F;      # Add file F.
>     svn rm Y;       # Delete file Y.
>     svn commit;     # Commit current tree on current branch.
>     svn copy A T;   # Create tag T from where A is now.
>
> But I don't know how it would issue a merge commit.  Or even if it
> could find such a thing in the RCS files.

AFAIK the svn code doesn't do merge commits. We probably need a post
processing pass in the git repo that finds the merges and closes off
the branches. gitk won't be pretty with 1,500 open branches. This may
need some manual clues.

Another thing missing in cvs2svn is a sane way to handle unnamed
branches. Right now it is generating names. It's not real good at
detecting everything that belongs together in an unnamed branch so one
unnamed branches can turn into dozens of branches.

Once we get everything in a tool where it is possible to visualize the
repo errors like this will be more obvious. That should make it easier
to come up with a strategy to fix the import process.

> The above command set would be rather trivial to implement in a
> fast-import backend.  I'm thinking we extend the protocol so it
> looks something like the following:
>
>   stream ::= cmd*;
>
>   cmd ::= new_blob
>         | new_branch
>         | set_branch
>         | update
>         | commit
>         | tag
>         ;
>
>   new_blob    ::= 'blob' blob_data;
>   new_branch  ::= 'newb' branch_name source_branch;
>   set_branch  ::= 'setb' branch_name;
>   add_file    ::= update_file;
>   update_file ::= 'updf' file_update;
>   delete_file ::= 'delf' file_delete;
>   commit      ::= 'comt' author_committer_msg;
>   tag         ::= 'tagg' branch_name tagger_msg;
>
>   source_branch ::= branch_name
>                   | zero32
>                   ;
>   branch_name ::= len32 branch_name_str;
>   branch_name_str ::= # valid GIT branch name, should be relative
>                       # to .git/ (so include a refs/heads prefix)
>                     ;
>   file_update ::= len32 mode sp hexsha1 sp path;
>   file_delete ::= len32 path;
>
>   blob_data ::= len32 binary_data;
>
>   author_committer_msg ::= len32
>     'author' sp name '<' email '>' ts tz lf
>     'committer' sp name '<' email '>' ts tz lf
>     lf
>     binary_data;
>
>   tagger_msg ::= len32
>     'tagger' sp name '<' email '>' ts tz lf
>     lf
>     binary_data;
>
>   len32 ::= # unsigned 32 bit value, native format;
>   zero32 ::= # 32 bits, unset, aka '0';
>
>   mode ::= 'P'    # plain file
>          | 'X'    # executable file
>          ;
>
>   binary_data ::= # file content, not interpreted;
>   sp ::= # ASCII space character;
>   lf ::= # ASCII newline (LF) character;
>   path ::= # GIT style file path, e.g. "a/b/c";
>   hexsha1 ::= # SHA1 in hexadecimal format;
>   nullsha1 ::= # 40 '0's in sequence;
>   name ::= # valid GIT author/committer name;
>   email ::= # valid GIT author/committer email;
>   ts ::= # time since the epoch in seconds, ascii decimal;
>   tz ::= # GIT style timezone;
>
> This is a change from the current protocol as new blobs need to
> get prefixed by the command 'blob'.  Note that all commands are 4
> bytes wide and that any variable data portion of a command uses a
> "Pascal style" leading 32 bit length.  Although ugly it just makes
> it a tiny bit easier to code the stream parser.  :-)
>
> The commits and tags are half generated in the frontend and half
> in the backend.  The backend is producing and tracking the tree and
> the current SHA1 of each branch.  Consequently it will generate the
> 'tree' and 'parent' lines of a commit or the 'object', 'type' and
> 'tag' lines of a tag.  This is limited to only a linear development
> path, no merges...
>
> The backend would need to run inside of an existing GIT repository
> as it would output all tags and branch heads when it terminates.
> (Right now it runs from anywhere.)
>
> I don't know how many branches you would be asking the backend to
> hold at once, but I was thinking that the backend would just create
> a tree structure in memory when it receives a new_branch command,
> and consider one of those to be the current branch when a set_branch
> command is issued.  With all branches in memory at once switching
> would be very cheap, but if the number of branches is high it could
> eat memory like a pig...
>
> Right now I'm thinking that a file entry in a tree would cost:
>
>   29 bytes + length_of_name + malloc_overhead
>
> while a single tree would cost that plus:
>
>   36 bytes + 4*number_of_entries + 2*malloc_overhead + last_treesz
>
> where last_treesz is last content of that tree (uncompressed),
> so we can deltify against it quickly.

The file names are used over and over. Alloc a giant chunk of memory
and keep appending the file name strings to it. Then build a little
tree so that you can look up existing names. i.e. turn the files names
into atoms. Never delete anything.

You can do something similar with the 29 and 36 byte arrays. Alloc two
giant chunks of memory. Append new entries to the end. Maintain a list
of deleted entries for reuse (store the pointer to the next free entry
inside the unused blocks, don't use a separate structure). You have
three things to track, end of used memory, head of free list, end of
allocated memory.

Now there is no malloc overhead on millions of items.

> So what's your worst case number of files and directories in a
> single commit?  And how many branches are we talking about carrying
> around in the backend?

About 100,000 files in the initial change set that builds the repo.
FInal repo has 120,000 files.

There are 1,500 branches. I haven't looked at the svn dump file format
for branches, but I suspect that it sends everything on a branch out
at once and doesn't intersperse it with the trunk commits.


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-08 12:11             ` Jon Smirl
@ 2006-08-08 22:45               ` Shawn Pearce
  2006-08-08 23:56                 ` Jon Smirl
  0 siblings, 1 reply; 15+ messages in thread
From: Shawn Pearce @ 2006-08-08 22:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> We're designing a dumpfile format for git like the one SVN has.

I'm not sure I'd call it a dumpfile format.  More like an importfile
format.  Reading a GIT pack is really pretty trivial; if someone was
going to write a parser/reader to pull apart a GIT repository and
use that information in another way they would just do it against
the pack files.  Its really not that much code.  But generating a
pack efficiently for a large volume of data is slightly less trivial;
the attempt here is to produce some tool that can take a relatively
trivial data stream and produce a reasonable (but not necessarily
absolute smallest) pack from it in the least amount of CPU and
disk time necessary to do the job.  I would hope that nobody would
seriously consider dumping a GIT repository back INTO this format!

[snip]
> AFAIK the svn code doesn't do merge commits. We probably need a post
> processing pass in the git repo that finds the merges and closes off
> the branches. gitk won't be pretty with 1,500 open branches. This may
> need some manual clues.

*wince* 1500 open branches.  Youch.  OK, that answers a lot of
questions for me with regards to memory handling in fast-import.
Which you provide excellent suggestions for below.  I guess I didn't
think you had nearly that many...

[snip]
> The file names are used over and over. Alloc a giant chunk of memory
> and keep appending the file name strings to it. Then build a little
> tree so that you can look up existing names. i.e. turn the files names
> into atoms. Never delete anything.

Agreed.  For 1500 branches its worth doing.

[snip]
> About 100,000 files in the initial change set that builds the repo.
> FInal repo has 120,000 files.
> 
> There are 1,500 branches. I haven't looked at the svn dump file format
> for branches, but I suspect that it sends everything on a branch out
> at once and doesn't intersperse it with the trunk commits.

If you can tell fast-import your are completely done processing a
branch I can recycle the memory I have tied up for that branch; but
if that's going to be difficult then...  hmm.

Right now I'm looking at around 5 MB/branch, based on implementing
the memory handling optimizations you suggested.  That's still *huge*
for 1500 branches.  I clearly can't hang onto every branch in memory
for the entire life of the import like I was planning on doing.
I'll kick that around for a couple of hours and see what I come
up with.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-08 22:45               ` Shawn Pearce
@ 2006-08-08 23:56                 ` Jon Smirl
  0 siblings, 0 replies; 15+ messages in thread
From: Jon Smirl @ 2006-08-08 23:56 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2940 bytes --]

On 8/8/06, Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > We're designing a dumpfile format for git like the one SVN has.
>
> I'm not sure I'd call it a dumpfile format.  More like an importfile
> format.  Reading a GIT pack is really pretty trivial; if someone was
> going to write a parser/reader to pull apart a GIT repository and
> use that information in another way they would just do it against
> the pack files.  Its really not that much code.  But generating a
> pack efficiently for a large volume of data is slightly less trivial;
> the attempt here is to produce some tool that can take a relatively
> trivial data stream and produce a reasonable (but not necessarily
> absolute smallest) pack from it in the least amount of CPU and
> disk time necessary to do the job.  I would hope that nobody would
> seriously consider dumping a GIT repository back INTO this format!
>
> [snip]
> > AFAIK the svn code doesn't do merge commits. We probably need a post
> > processing pass in the git repo that finds the merges and closes off
> > the branches. gitk won't be pretty with 1,500 open branches. This may
> > need some manual clues.
>
> *wince* 1500 open branches.  Youch.  OK, that answers a lot of
> questions for me with regards to memory handling in fast-import.
> Which you provide excellent suggestions for below.  I guess I didn't
> think you had nearly that many...
>
> [snip]
> > The file names are used over and over. Alloc a giant chunk of memory
> > and keep appending the file name strings to it. Then build a little
> > tree so that you can look up existing names. i.e. turn the files names
> > into atoms. Never delete anything.
>
> Agreed.  For 1500 branches its worth doing.
>
> [snip]
> > About 100,000 files in the initial change set that builds the repo.
> > FInal repo has 120,000 files.
> >
> > There are 1,500 branches. I haven't looked at the svn dump file format
> > for branches, but I suspect that it sends everything on a branch out
> > at once and doesn't intersperse it with the trunk commits.
>
> If you can tell fast-import your are completely done processing a
> branch I can recycle the memory I have tied up for that branch; but
> if that's going to be difficult then...  hmm.
>
> Right now I'm looking at around 5 MB/branch, based on implementing
> the memory handling optimizations you suggested.  That's still *huge*
> for 1500 branches.  I clearly can't hang onto every branch in memory
> for the entire life of the import like I was planning on doing.
> I'll kick that around for a couple of hours and see what I come
> up with.

Some of these branches are what cvs2svn calls unlabeled branches.
cvs2svn is probably creating more of these than necessary since the
code for coalescing them into a single big unlabeled branch is not
that good.

I attached the list of branch names being generated.



>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

[-- Attachment #2: cvs2svn-branches.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 24830 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07  4:48     ` Jon Smirl
  2006-08-07  5:04       ` Shawn Pearce
@ 2006-08-07  5:10       ` Martin Langhoff
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Langhoff @ 2006-08-07  5:10 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On 8/7/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote:
> > So the new version should take about 20 MB of memory and should
> > produce a valid pack and index in the same time as it does only
> > the pack now.  Plus it won't generate duplicates.
>
> I did a run with this and it works great.

Great.

> I'm staring at the cvs2svn code now trying to figure out how to modify
> it without rewriting everything. I may just leave it all alone and
> build a table with cvs_file:rev to sha-1 mappings. It would be much
> more efficient to carry sha-1 throughout the stages but that may
> require significant rework.

Probably a good thing. If the patch isn't intrusive it has a better
chance of merging well over time and of being merged upstream -- ;-)


m

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-06 18:03   ` Shawn Pearce
  2006-08-07  4:48     ` Jon Smirl
@ 2006-08-07  7:57     ` Ryan Anderson
  2006-08-07 23:02       ` Shawn Pearce
  1 sibling, 1 reply; 15+ messages in thread
From: Ryan Anderson @ 2006-08-07  7:57 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

On Sun, Aug 06, 2006 at 02:03:24PM -0400, Shawn Pearce wrote:
> 
>   - It expects an estimated object count as its second parameter.
>     In your case this would be something around 760000.  This tells
>     it how large of an object table to allocate, with each entry
>     being 24 bytes + 1 pointer (28 or 32 bytes).  Overshooting
> 	this number will cause it to degrade by allocating one
> 	overflow entry at a time from malloc.

Hrm, you're allocating a big table and then assigning consecutive
entries out of it, as pointers.

Why not just malloc a big block, and assign offsets into it, as if it
were a really big array.  Every time it runs out, realloc it to double
the current size, and update the base pointer.

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: fast-import and unique objects.
  2006-08-07  7:57     ` Ryan Anderson
@ 2006-08-07 23:02       ` Shawn Pearce
  0 siblings, 0 replies; 15+ messages in thread
From: Shawn Pearce @ 2006-08-07 23:02 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Jon Smirl, git

Ryan Anderson <ryan@michonline.com> wrote:
> On Sun, Aug 06, 2006 at 02:03:24PM -0400, Shawn Pearce wrote:
> > 
> >   - It expects an estimated object count as its second parameter.
> >     In your case this would be something around 760000.  This tells
> >     it how large of an object table to allocate, with each entry
> >     being 24 bytes + 1 pointer (28 or 32 bytes).  Overshooting
> > 	this number will cause it to degrade by allocating one
> > 	overflow entry at a time from malloc.
> 
> Hrm, you're allocating a big table and then assigning consecutive
> entries out of it, as pointers.
> 
> Why not just malloc a big block, and assign offsets into it, as if it
> were a really big array.  Every time it runs out, realloc it to double
> the current size, and update the base pointer.

Because I didn't want to move a 24 MB block of memory.  :-)

I'm probably going to clean that section of code up tonight and
allocate a large block at the beginning then allocate overflow blocks
at about 5000 entries at a time.  There's no need for the blocks to
be contiguous in memory, I just didn't want to have a high overhead
from malloc when there would be a large number of them...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-08-08 23:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-06 12:32 fast-import and unique objects Jon Smirl
2006-08-06 15:53 ` Jon Smirl
2006-08-06 18:03   ` Shawn Pearce
2006-08-07  4:48     ` Jon Smirl
2006-08-07  5:04       ` Shawn Pearce
2006-08-07 14:37         ` Jon Smirl
2006-08-07 14:48           ` Jakub Narebski
2006-08-07 18:45             ` Jon Smirl
2006-08-08  3:12           ` Shawn Pearce
2006-08-08 12:11             ` Jon Smirl
2006-08-08 22:45               ` Shawn Pearce
2006-08-08 23:56                 ` Jon Smirl
2006-08-07  5:10       ` Martin Langhoff
2006-08-07  7:57     ` Ryan Anderson
2006-08-07 23:02       ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).