* fast-import and unique objects. @ 2006-08-06 12:32 Jon Smirl 2006-08-06 15:53 ` Jon Smirl 0 siblings, 1 reply; 15+ messages in thread From: Jon Smirl @ 2006-08-06 12:32 UTC (permalink / raw) To: git, Shawn Pearce This model has a lot of object duplication. I generated 949,305 revisions, but only 754,165 are unique. I'll modify my code to build a hash of the objects it has seen and then not send the duplicates to fast-import. Those 195,140 duplicated objects may be what is tripping index-pack up. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-06 12:32 fast-import and unique objects Jon Smirl @ 2006-08-06 15:53 ` Jon Smirl 2006-08-06 18:03 ` Shawn Pearce 0 siblings, 1 reply; 15+ messages in thread From: Jon Smirl @ 2006-08-06 15:53 UTC (permalink / raw) To: git, Shawn Pearce On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote: > This model has a lot of object duplication. I generated 949,305 > revisions, but only 754,165 are unique. I'll modify my code to build a > hash of the objects it has seen and then not send the duplicates to > fast-import. Those 195,140 duplicated objects may be what is tripping > index-pack up. New run is finished with duplicate removal. Time to run is unchanged, still 2hrs. Run time is IO bound not CPU. Pack file is 845MB instead of 934MB. git-index-pack works now, it takes 4 CPU minutes to run. Index file is 18MB. So it looks like the first stage code is working. Next I need to modify cvs2svn to keep track of the sha-1 through it's sorting process instead of file:revision. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-06 15:53 ` Jon Smirl @ 2006-08-06 18:03 ` Shawn Pearce 2006-08-07 4:48 ` Jon Smirl 2006-08-07 7:57 ` Ryan Anderson 0 siblings, 2 replies; 15+ messages in thread From: Shawn Pearce @ 2006-08-06 18:03 UTC (permalink / raw) To: Jon Smirl; +Cc: git [-- Attachment #1: Type: text/plain, Size: 2133 bytes --] Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote: > >This model has a lot of object duplication. I generated 949,305 > >revisions, but only 754,165 are unique. I'll modify my code to build a > >hash of the objects it has seen and then not send the duplicates to > >fast-import. Those 195,140 duplicated objects may be what is tripping > >index-pack up. > > New run is finished with duplicate removal. > > Time to run is unchanged, still 2hrs. Run time is IO bound not CPU. > Pack file is 845MB instead of 934MB. > git-index-pack works now, it takes 4 CPU minutes to run. > Index file is 18MB. I'm attaching a new version of fast-import.c which generates the index, and does duplicate removal. However I think that it might be slightly faster for you to do the duplicate removal in Python as it saves the user-kernel-user copy of the file data. Even so, this new version should save you those 4 CPU minutes as the index will be generated from the in-memory SHA1s rather than needing to recompute them. I've changed the calling convention: - It now takes the pack's base name as its first parameter. It appends ".pack" and ".idx" to form the actual file names its writing to. - It expects an estimated object count as its second parameter. In your case this would be something around 760000. This tells it how large of an object table to allocate, with each entry being 24 bytes + 1 pointer (28 or 32 bytes). Overshooting this number will cause it to degrade by allocating one overflow entry at a time from malloc. So the new version should take about 20 MB of memory and should produce a valid pack and index in the same time as it does only the pack now. Plus it won't generate duplicates. > So it looks like the first stage code is working. Next I need to > modify cvs2svn to keep track of the sha-1 through it's sorting process > instead of file:revision. When you get down to tree writing and commit writing we might want to do something similiar with the trees and commits. I can modify fast-import to also store those off into a pack. -- Shawn. [-- Attachment #2: fast-import.c --] [-- Type: text/x-csrc, Size: 8018 bytes --] #include "builtin.h" #include "cache.h" #include "object.h" #include "blob.h" #include "delta.h" #include "pack.h" #include "csum-file.h" static int max_depth = 10; static unsigned long object_count; static unsigned long duplicate_count; static unsigned long packoff; static unsigned long overflow_count; static int packfd; static int current_depth; static void *lastdat; static unsigned long lastdatlen; static unsigned char lastsha1[20]; static unsigned char packsha1[20]; struct object_entry { struct object_entry *next; unsigned long offset; unsigned char sha1[20]; }; struct overflow_object_entry { struct overflow_object_entry *next; struct object_entry oe; }; struct object_entry *pool_start; struct object_entry *pool_next; struct object_entry *pool_end; struct overflow_object_entry *overflow; struct object_entry *table[1 << 16]; static struct object_entry* new_object(unsigned char *sha1) { if (pool_next != pool_end) { struct object_entry *e = pool_next++; memcpy(e->sha1, sha1, sizeof(e->sha1)); return e; } else { struct overflow_object_entry *e; e = xmalloc(sizeof(struct overflow_object_entry)); e->next = overflow; memcpy(e->oe.sha1, sha1, sizeof(e->oe.sha1)); overflow = e; overflow_count++; return &e->oe; } } static struct object_entry* insert_object(unsigned char *sha1) { unsigned int h = sha1[0] << 8 | sha1[1]; struct object_entry *e = table[h]; struct object_entry *p = 0; while (e) { if (!memcmp(sha1, e->sha1, sizeof(e->sha1))) return e; p = e; e = e->next; } e = new_object(sha1); e->next = 0; e->offset = 0; if (p) p->next = e; else table[h] = e; return e; } static ssize_t yread(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xread(fd, (char *) buffer + ret, length - ret); if (size < 0) { return size; } if (size == 0) { return ret; } ret += size; } return ret; } static ssize_t ywrite(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret); if (size < 0) { return size; } if (size == 0) { return ret; } ret += size; } return ret; } static unsigned long encode_header(enum object_type type, unsigned long size, unsigned char *hdr) { int n = 1; unsigned char c; if (type < OBJ_COMMIT || type > OBJ_DELTA) die("bad type %d", type); c = (type << 4) | (size & 15); size >>= 4; while (size) { *hdr++ = c | 0x80; c = size & 0x7f; size >>= 7; n++; } *hdr = c; return n; } static void write_blob(void *dat, unsigned long datlen) { z_stream s; void *out, *delta; unsigned char hdr[64]; unsigned long hdrlen, deltalen; if (lastdat && current_depth < max_depth) { delta = diff_delta(lastdat, lastdatlen, dat, datlen, &deltalen, 0); } else delta = 0; memset(&s, 0, sizeof(s)); deflateInit(&s, zlib_compression_level); if (delta) { current_depth++; s.next_in = delta; s.avail_in = deltalen; hdrlen = encode_header(OBJ_DELTA, deltalen, hdr); if (ywrite(packfd, hdr, hdrlen) != hdrlen) die("Can't write object header: %s", strerror(errno)); if (ywrite(packfd, lastsha1, sizeof(lastsha1)) != sizeof(lastsha1)) die("Can't write object base: %s", strerror(errno)); packoff += hdrlen + sizeof(lastsha1); } else { current_depth = 0; s.next_in = dat; s.avail_in = datlen; hdrlen = encode_header(OBJ_BLOB, datlen, hdr); if (ywrite(packfd, hdr, hdrlen) != hdrlen) die("Can't write object header: %s", strerror(errno)); packoff += hdrlen; } s.avail_out = deflateBound(&s, s.avail_in); s.next_out = out = xmalloc(s.avail_out); while (deflate(&s, Z_FINISH) == Z_OK) /* nothing */; deflateEnd(&s); if (ywrite(packfd, out, s.total_out) != s.total_out) die("Failed writing compressed data %s", strerror(errno)); packoff += s.total_out; free(out); if (delta) free(delta); } static void init_pack_header() { const char* magic = "PACK"; unsigned long version = 2; unsigned long zero = 0; version = htonl(version); if (ywrite(packfd, (char*)magic, 4) != 4) die("Can't write pack magic: %s", strerror(errno)); if (ywrite(packfd, &version, 4) != 4) die("Can't write pack version: %s", strerror(errno)); if (ywrite(packfd, &zero, 4) != 4) die("Can't write 0 object count: %s", strerror(errno)); packoff = 4 * 3; } static void fixup_header_footer() { SHA_CTX c; char hdr[8]; unsigned long cnt; char *buf; size_t n; if (lseek(packfd, 0, SEEK_SET) != 0) die("Failed seeking to start: %s", strerror(errno)); SHA1_Init(&c); if (yread(packfd, hdr, 8) != 8) die("Failed reading header: %s", strerror(errno)); SHA1_Update(&c, hdr, 8); cnt = htonl(object_count); SHA1_Update(&c, &cnt, 4); if (ywrite(packfd, &cnt, 4) != 4) die("Failed writing object count: %s", strerror(errno)); buf = xmalloc(128 * 1024); for (;;) { n = xread(packfd, buf, 128 * 1024); if (n <= 0) break; SHA1_Update(&c, buf, n); } free(buf); SHA1_Final(packsha1, &c); if (ywrite(packfd, packsha1, sizeof(packsha1)) != sizeof(packsha1)) die("Failed writing pack checksum: %s", strerror(errno)); } static int oecmp (const void *_a, const void *_b) { struct object_entry *a = *((struct object_entry**)_a); struct object_entry *b = *((struct object_entry**)_b); return memcmp(a->sha1, b->sha1, sizeof(a->sha1)); } static void write_index(const char *idx_name) { struct sha1file *f; struct object_entry **idx, **c, **last; struct object_entry *e; struct overflow_object_entry *o; unsigned int array[256]; int i; /* Build the sorted table of object IDs. */ idx = xmalloc(object_count * sizeof(struct object_entry*)); c = idx; for (e = pool_start; e != pool_next; e++) *c++ = e; for (o = overflow; o; o = o->next) *c++ = &o->oe; last = idx + object_count; qsort(idx, object_count, sizeof(struct object_entry*), oecmp); /* Generate the fan-out array. */ c = idx; for (i = 0; i < 256; i++) { struct object_entry **next = c;; while (next < last) { if ((*next)->sha1[0] != i) break; next++; } array[i] = htonl(next - idx); c = next; } f = sha1create("%s", idx_name); sha1write(f, array, 256 * sizeof(int)); for (c = idx; c != last; c++) { unsigned int offset = htonl((*c)->offset); sha1write(f, &offset, 4); sha1write(f, (*c)->sha1, sizeof((*c)->sha1)); } sha1write(f, packsha1, sizeof(packsha1)); sha1close(f, NULL, 1); free(idx); } int main(int argc, const char **argv) { const char *base_name = argv[1]; int est_obj_cnt = atoi(argv[2]); char *pack_name; char *idx_name; pack_name = xmalloc(strlen(base_name) + 6); sprintf(pack_name, "%s.pack", base_name); idx_name = xmalloc(strlen(base_name) + 5); sprintf(idx_name, "%s.idx", base_name); packfd = open(pack_name, O_RDWR|O_CREAT|O_TRUNC, 0666); if (packfd < 0) die("Can't create pack file %s: %s", pack_name, strerror(errno)); pool_start = xmalloc(est_obj_cnt * sizeof(struct object_entry)); pool_next = pool_start; pool_end = pool_start + est_obj_cnt; init_pack_header(); for (;;) { unsigned long datlen; int hdrlen; void *dat; char hdr[128]; unsigned char sha1[20]; SHA_CTX c; struct object_entry *e; if (yread(0, &datlen, 4) != 4) break; dat = xmalloc(datlen); if (yread(0, dat, datlen) != datlen) break; hdrlen = sprintf(hdr, "blob %lu", datlen) + 1; SHA1_Init(&c); SHA1_Update(&c, hdr, hdrlen); SHA1_Update(&c, dat, datlen); SHA1_Final(sha1, &c); e = insert_object(sha1); if (!e->offset) { e->offset = packoff; write_blob(dat, datlen); object_count++; printf("%s\n", sha1_to_hex(sha1)); fflush(stdout); if (lastdat) free(lastdat); lastdat = dat; lastdatlen = datlen; memcpy(lastsha1, sha1, sizeof(sha1)); } else { duplicate_count++; free(dat); } } fixup_header_footer(); close(packfd); write_index(idx_name); fprintf(stderr, "%lu objects, %lu duplicates, %lu pool overflow\n", object_count, duplicate_count, overflow_count); return 0; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-06 18:03 ` Shawn Pearce @ 2006-08-07 4:48 ` Jon Smirl 2006-08-07 5:04 ` Shawn Pearce 2006-08-07 5:10 ` Martin Langhoff 2006-08-07 7:57 ` Ryan Anderson 1 sibling, 2 replies; 15+ messages in thread From: Jon Smirl @ 2006-08-07 4:48 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote: > So the new version should take about 20 MB of memory and should > produce a valid pack and index in the same time as it does only > the pack now. Plus it won't generate duplicates. I did a run with this and it works great. I'm staring at the cvs2svn code now trying to figure out how to modify it without rewriting everything. I may just leave it all alone and build a table with cvs_file:rev to sha-1 mappings. It would be much more efficient to carry sha-1 throughout the stages but that may require significant rework. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 4:48 ` Jon Smirl @ 2006-08-07 5:04 ` Shawn Pearce 2006-08-07 14:37 ` Jon Smirl 2006-08-07 5:10 ` Martin Langhoff 1 sibling, 1 reply; 15+ messages in thread From: Shawn Pearce @ 2006-08-07 5:04 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote: > >So the new version should take about 20 MB of memory and should > >produce a valid pack and index in the same time as it does only > >the pack now. Plus it won't generate duplicates. > > I did a run with this and it works great. Good. :-) On my drive in to work this afternoon I realized that making you specify the size of the object table is stupid, I could easily allocate a thousand objects at a time rather than preallocating the whole thing. Oh well. fast-import thus far hasn't been meant as production code for inclusion in core GIT, but maybe it will get cleaned up and submitted as such if your conversion efforts go well and produce a better CVS importer. > I'm staring at the cvs2svn code now trying to figure out how to modify > it without rewriting everything. I may just leave it all alone and > build a table with cvs_file:rev to sha-1 mappings. It would be much > more efficient to carry sha-1 throughout the stages but that may > require significant rework. Does it matter? How long does the cvs2svn processing take, excluding the GIT blob processing that's now known to take 2 hours? What's your target for an acceptable conversion time on the system you are working on? Any thoughts yet on how you might want to feed trees and commits to a fast pack writer? I was thinking about doing a stream into fast-import such as: <4 byte length of commit><commit><treeent>*<null> where <commit> is the raw commit minus the first "tree nnn\n" line, and <treeent> is: <type><sp><sha1><sp><path><null> where <type> is one of 'B' (normal blob), 'L' (symlink), 'X' (executable blob), <sha1> is the 40 byte hex, <path> is the file from the root of the repository ("src/module/foo.c"), and <sp> and <null> are the obvious values. You would feed all tree entries and the pack writer would split the stream up into the individual tree objects. fast-import would generate the tree(s) delta'ing them against the prior tree of the same path, prefix "tree nnn\n" to the commit blob you supplied, generate the commit, and print out its ID. By working from the first commit up to the most recent each tree deltas would be using the older tree as the base which may not be ideal if a large number of items get added to a tree but should be effective enough to generate a reasonably sized initial pack. It would however mean you need to monitor the output pipe from fast-import to get back the commit id so you can use it to prep the next commit's parent(s) as you can't produce that in Python. -- Shawn. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 5:04 ` Shawn Pearce @ 2006-08-07 14:37 ` Jon Smirl 2006-08-07 14:48 ` Jakub Narebski 2006-08-08 3:12 ` Shawn Pearce 0 siblings, 2 replies; 15+ messages in thread From: Jon Smirl @ 2006-08-07 14:37 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote: > > I'm staring at the cvs2svn code now trying to figure out how to modify > > it without rewriting everything. I may just leave it all alone and > > build a table with cvs_file:rev to sha-1 mappings. It would be much > > more efficient to carry sha-1 throughout the stages but that may > > require significant rework. > > Does it matter? How long does the cvs2svn processing take, > excluding the GIT blob processing that's now known to take 2 hours? > What's your target for an acceptable conversion time on the system > you are working on? As is, it takes the code about a week to import MozCVS into Subversion. But I've already addressed the core of why that was taking so long. The original code forks off a copy of cvs for each revision to exact the text. Doing that 1M times takes about two days. The version with fast-import takes two hours. At the end of the process cvs2svn forks off svn 250K times to import the change sets. That takes about four days to finish. Doing a fast-import backend should fix that. > Any thoughts yet on how you might want to feed trees and commits > to a fast pack writer? I was thinking about doing a stream into > fast-import such as: The data I have generates an output that indicates add/change/delete for each file name. Add/change should have an associated sha-1 for the new revision. cvs/svn have no concept of trees. How about sending out a stream of add/change/delete operations interspersed with commits? That would let fast-import track the tree and only generate tree nodes when they change. The protocol may need some thought. I need to be able to handle branches and labels too. > <4 byte length of commit><commit><treeent>*<null> > > where <commit> is the raw commit minus the first "tree nnn\n" line, and > <treeent> is: > > <type><sp><sha1><sp><path><null> > > where <type> is one of 'B' (normal blob), 'L' (symlink), 'X' > (executable blob), <sha1> is the 40 byte hex, <path> is the file from > the root of the repository ("src/module/foo.c"), and <sp> and <null> > are the obvious values. You would feed all tree entries and the pack > writer would split the stream up into the individual tree objects. > > fast-import would generate the tree(s) delta'ing them against the > prior tree of the same path, prefix "tree nnn\n" to the commit > blob you supplied, generate the commit, and print out its ID. > By working from the first commit up to the most recent each tree > deltas would be using the older tree as the base which may not be > ideal if a large number of items get added to a tree but should be > effective enough to generate a reasonably sized initial pack. > > It would however mean you need to monitor the output pipe from > fast-import to get back the commit id so you can use it to prep > the next commit's parent(s) as you can't produce that in Python. > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 14:37 ` Jon Smirl @ 2006-08-07 14:48 ` Jakub Narebski 2006-08-07 18:45 ` Jon Smirl 2006-08-08 3:12 ` Shawn Pearce 1 sibling, 1 reply; 15+ messages in thread From: Jakub Narebski @ 2006-08-07 14:48 UTC (permalink / raw) To: git Jon Smirl wrote: > How about sending out a stream of add/change/delete operations > interspersed with commits? That would let fast-import track the tree > and only generate tree nodes when they change. The problem with CVS, which cvsps (CVS PatchSet) tries to address is that changes are to a file, and reconstructing which changes go together to make _one_ commit... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 14:48 ` Jakub Narebski @ 2006-08-07 18:45 ` Jon Smirl 0 siblings, 0 replies; 15+ messages in thread From: Jon Smirl @ 2006-08-07 18:45 UTC (permalink / raw) To: Jakub Narebski; +Cc: git On 8/7/06, Jakub Narebski <jnareb@gmail.com> wrote: > Jon Smirl wrote: > > > How about sending out a stream of add/change/delete operations > > interspersed with commits? That would let fast-import track the tree > > and only generate tree nodes when they change. > > The problem with CVS, which cvsps (CVS PatchSet) tries to address is that > changes are to a file, and reconstructing which changes go together to make > _one_ commit... The cvs2svn code is also gathering CVS commits into change sets. The advantage of the cvs2svn code is that it is written by some of the original CVS developers. Those developers are familiar with the 1,000 different ways to break a CVS repository and cvs2svn tries to deal with the breakage. cvsps works for some repositories but it fails on the Mozilla one. There is something about branches in the Mozilla repository that confuses it. cvs2svn handles the Mozilla repository correctly as far as I can tell. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 14:37 ` Jon Smirl 2006-08-07 14:48 ` Jakub Narebski @ 2006-08-08 3:12 ` Shawn Pearce 2006-08-08 12:11 ` Jon Smirl 1 sibling, 1 reply; 15+ messages in thread From: Shawn Pearce @ 2006-08-08 3:12 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > the change sets. That takes about four days to finish. Doing a > fast-import backend should fix that. Shouldn't be a problem. :-) > >Any thoughts yet on how you might want to feed trees and commits > >to a fast pack writer? I was thinking about doing a stream into > >fast-import such as: > > The data I have generates an output that indicates add/change/delete > for each file name. Add/change should have an associated sha-1 for the > new revision. cvs/svn have no concept of trees. > > How about sending out a stream of add/change/delete operations > interspersed with commits? That would let fast-import track the tree > and only generate tree nodes when they change. > > The protocol may need some thought. I need to be able to handle > branches and labels too. Knowing a little bit about SVN I would assume the current cvs2svn code would issue commands like: svn copy A B; # Make branch B starting from where A is now. svn switch B; # Make branch B the current branch. svn add F; # Add file F. svn rm Y; # Delete file Y. svn commit; # Commit current tree on current branch. svn copy A T; # Create tag T from where A is now. But I don't know how it would issue a merge commit. Or even if it could find such a thing in the RCS files. The above command set would be rather trivial to implement in a fast-import backend. I'm thinking we extend the protocol so it looks something like the following: stream ::= cmd*; cmd ::= new_blob | new_branch | set_branch | update | commit | tag ; new_blob ::= 'blob' blob_data; new_branch ::= 'newb' branch_name source_branch; set_branch ::= 'setb' branch_name; add_file ::= update_file; update_file ::= 'updf' file_update; delete_file ::= 'delf' file_delete; commit ::= 'comt' author_committer_msg; tag ::= 'tagg' branch_name tagger_msg; source_branch ::= branch_name | zero32 ; branch_name ::= len32 branch_name_str; branch_name_str ::= # valid GIT branch name, should be relative # to .git/ (so include a refs/heads prefix) ; file_update ::= len32 mode sp hexsha1 sp path; file_delete ::= len32 path; blob_data ::= len32 binary_data; author_committer_msg ::= len32 'author' sp name '<' email '>' ts tz lf 'committer' sp name '<' email '>' ts tz lf lf binary_data; tagger_msg ::= len32 'tagger' sp name '<' email '>' ts tz lf lf binary_data; len32 ::= # unsigned 32 bit value, native format; zero32 ::= # 32 bits, unset, aka '0'; mode ::= 'P' # plain file | 'X' # executable file ; binary_data ::= # file content, not interpreted; sp ::= # ASCII space character; lf ::= # ASCII newline (LF) character; path ::= # GIT style file path, e.g. "a/b/c"; hexsha1 ::= # SHA1 in hexadecimal format; nullsha1 ::= # 40 '0's in sequence; name ::= # valid GIT author/committer name; email ::= # valid GIT author/committer email; ts ::= # time since the epoch in seconds, ascii decimal; tz ::= # GIT style timezone; This is a change from the current protocol as new blobs need to get prefixed by the command 'blob'. Note that all commands are 4 bytes wide and that any variable data portion of a command uses a "Pascal style" leading 32 bit length. Although ugly it just makes it a tiny bit easier to code the stream parser. :-) The commits and tags are half generated in the frontend and half in the backend. The backend is producing and tracking the tree and the current SHA1 of each branch. Consequently it will generate the 'tree' and 'parent' lines of a commit or the 'object', 'type' and 'tag' lines of a tag. This is limited to only a linear development path, no merges... The backend would need to run inside of an existing GIT repository as it would output all tags and branch heads when it terminates. (Right now it runs from anywhere.) I don't know how many branches you would be asking the backend to hold at once, but I was thinking that the backend would just create a tree structure in memory when it receives a new_branch command, and consider one of those to be the current branch when a set_branch command is issued. With all branches in memory at once switching would be very cheap, but if the number of branches is high it could eat memory like a pig... Right now I'm thinking that a file entry in a tree would cost: 29 bytes + length_of_name + malloc_overhead while a single tree would cost that plus: 36 bytes + 4*number_of_entries + 2*malloc_overhead + last_treesz where last_treesz is last content of that tree (uncompressed), so we can deltify against it quickly. So what's your worst case number of files and directories in a single commit? And how many branches are we talking about carrying around in the backend? -- Shawn. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-08 3:12 ` Shawn Pearce @ 2006-08-08 12:11 ` Jon Smirl 2006-08-08 22:45 ` Shawn Pearce 0 siblings, 1 reply; 15+ messages in thread From: Jon Smirl @ 2006-08-08 12:11 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote: > Knowing a little bit about SVN I would assume the current cvs2svn > code would issue commands like: We're designing a dumpfile format for git like the one SVN has. > svn copy A B; # Make branch B starting from where A is now. > svn switch B; # Make branch B the current branch. > svn add F; # Add file F. > svn rm Y; # Delete file Y. > svn commit; # Commit current tree on current branch. > svn copy A T; # Create tag T from where A is now. > > But I don't know how it would issue a merge commit. Or even if it > could find such a thing in the RCS files. AFAIK the svn code doesn't do merge commits. We probably need a post processing pass in the git repo that finds the merges and closes off the branches. gitk won't be pretty with 1,500 open branches. This may need some manual clues. Another thing missing in cvs2svn is a sane way to handle unnamed branches. Right now it is generating names. It's not real good at detecting everything that belongs together in an unnamed branch so one unnamed branches can turn into dozens of branches. Once we get everything in a tool where it is possible to visualize the repo errors like this will be more obvious. That should make it easier to come up with a strategy to fix the import process. > The above command set would be rather trivial to implement in a > fast-import backend. I'm thinking we extend the protocol so it > looks something like the following: > > stream ::= cmd*; > > cmd ::= new_blob > | new_branch > | set_branch > | update > | commit > | tag > ; > > new_blob ::= 'blob' blob_data; > new_branch ::= 'newb' branch_name source_branch; > set_branch ::= 'setb' branch_name; > add_file ::= update_file; > update_file ::= 'updf' file_update; > delete_file ::= 'delf' file_delete; > commit ::= 'comt' author_committer_msg; > tag ::= 'tagg' branch_name tagger_msg; > > source_branch ::= branch_name > | zero32 > ; > branch_name ::= len32 branch_name_str; > branch_name_str ::= # valid GIT branch name, should be relative > # to .git/ (so include a refs/heads prefix) > ; > file_update ::= len32 mode sp hexsha1 sp path; > file_delete ::= len32 path; > > blob_data ::= len32 binary_data; > > author_committer_msg ::= len32 > 'author' sp name '<' email '>' ts tz lf > 'committer' sp name '<' email '>' ts tz lf > lf > binary_data; > > tagger_msg ::= len32 > 'tagger' sp name '<' email '>' ts tz lf > lf > binary_data; > > len32 ::= # unsigned 32 bit value, native format; > zero32 ::= # 32 bits, unset, aka '0'; > > mode ::= 'P' # plain file > | 'X' # executable file > ; > > binary_data ::= # file content, not interpreted; > sp ::= # ASCII space character; > lf ::= # ASCII newline (LF) character; > path ::= # GIT style file path, e.g. "a/b/c"; > hexsha1 ::= # SHA1 in hexadecimal format; > nullsha1 ::= # 40 '0's in sequence; > name ::= # valid GIT author/committer name; > email ::= # valid GIT author/committer email; > ts ::= # time since the epoch in seconds, ascii decimal; > tz ::= # GIT style timezone; > > This is a change from the current protocol as new blobs need to > get prefixed by the command 'blob'. Note that all commands are 4 > bytes wide and that any variable data portion of a command uses a > "Pascal style" leading 32 bit length. Although ugly it just makes > it a tiny bit easier to code the stream parser. :-) > > The commits and tags are half generated in the frontend and half > in the backend. The backend is producing and tracking the tree and > the current SHA1 of each branch. Consequently it will generate the > 'tree' and 'parent' lines of a commit or the 'object', 'type' and > 'tag' lines of a tag. This is limited to only a linear development > path, no merges... > > The backend would need to run inside of an existing GIT repository > as it would output all tags and branch heads when it terminates. > (Right now it runs from anywhere.) > > I don't know how many branches you would be asking the backend to > hold at once, but I was thinking that the backend would just create > a tree structure in memory when it receives a new_branch command, > and consider one of those to be the current branch when a set_branch > command is issued. With all branches in memory at once switching > would be very cheap, but if the number of branches is high it could > eat memory like a pig... > > Right now I'm thinking that a file entry in a tree would cost: > > 29 bytes + length_of_name + malloc_overhead > > while a single tree would cost that plus: > > 36 bytes + 4*number_of_entries + 2*malloc_overhead + last_treesz > > where last_treesz is last content of that tree (uncompressed), > so we can deltify against it quickly. The file names are used over and over. Alloc a giant chunk of memory and keep appending the file name strings to it. Then build a little tree so that you can look up existing names. i.e. turn the files names into atoms. Never delete anything. You can do something similar with the 29 and 36 byte arrays. Alloc two giant chunks of memory. Append new entries to the end. Maintain a list of deleted entries for reuse (store the pointer to the next free entry inside the unused blocks, don't use a separate structure). You have three things to track, end of used memory, head of free list, end of allocated memory. Now there is no malloc overhead on millions of items. > So what's your worst case number of files and directories in a > single commit? And how many branches are we talking about carrying > around in the backend? About 100,000 files in the initial change set that builds the repo. FInal repo has 120,000 files. There are 1,500 branches. I haven't looked at the svn dump file format for branches, but I suspect that it sends everything on a branch out at once and doesn't intersperse it with the trunk commits. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-08 12:11 ` Jon Smirl @ 2006-08-08 22:45 ` Shawn Pearce 2006-08-08 23:56 ` Jon Smirl 0 siblings, 1 reply; 15+ messages in thread From: Shawn Pearce @ 2006-08-08 22:45 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > We're designing a dumpfile format for git like the one SVN has. I'm not sure I'd call it a dumpfile format. More like an importfile format. Reading a GIT pack is really pretty trivial; if someone was going to write a parser/reader to pull apart a GIT repository and use that information in another way they would just do it against the pack files. Its really not that much code. But generating a pack efficiently for a large volume of data is slightly less trivial; the attempt here is to produce some tool that can take a relatively trivial data stream and produce a reasonable (but not necessarily absolute smallest) pack from it in the least amount of CPU and disk time necessary to do the job. I would hope that nobody would seriously consider dumping a GIT repository back INTO this format! [snip] > AFAIK the svn code doesn't do merge commits. We probably need a post > processing pass in the git repo that finds the merges and closes off > the branches. gitk won't be pretty with 1,500 open branches. This may > need some manual clues. *wince* 1500 open branches. Youch. OK, that answers a lot of questions for me with regards to memory handling in fast-import. Which you provide excellent suggestions for below. I guess I didn't think you had nearly that many... [snip] > The file names are used over and over. Alloc a giant chunk of memory > and keep appending the file name strings to it. Then build a little > tree so that you can look up existing names. i.e. turn the files names > into atoms. Never delete anything. Agreed. For 1500 branches its worth doing. [snip] > About 100,000 files in the initial change set that builds the repo. > FInal repo has 120,000 files. > > There are 1,500 branches. I haven't looked at the svn dump file format > for branches, but I suspect that it sends everything on a branch out > at once and doesn't intersperse it with the trunk commits. If you can tell fast-import your are completely done processing a branch I can recycle the memory I have tied up for that branch; but if that's going to be difficult then... hmm. Right now I'm looking at around 5 MB/branch, based on implementing the memory handling optimizations you suggested. That's still *huge* for 1500 branches. I clearly can't hang onto every branch in memory for the entire life of the import like I was planning on doing. I'll kick that around for a couple of hours and see what I come up with. -- Shawn. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-08 22:45 ` Shawn Pearce @ 2006-08-08 23:56 ` Jon Smirl 0 siblings, 0 replies; 15+ messages in thread From: Jon Smirl @ 2006-08-08 23:56 UTC (permalink / raw) To: Shawn Pearce; +Cc: git [-- Attachment #1: Type: text/plain, Size: 2940 bytes --] On 8/8/06, Shawn Pearce <spearce@spearce.org> wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > We're designing a dumpfile format for git like the one SVN has. > > I'm not sure I'd call it a dumpfile format. More like an importfile > format. Reading a GIT pack is really pretty trivial; if someone was > going to write a parser/reader to pull apart a GIT repository and > use that information in another way they would just do it against > the pack files. Its really not that much code. But generating a > pack efficiently for a large volume of data is slightly less trivial; > the attempt here is to produce some tool that can take a relatively > trivial data stream and produce a reasonable (but not necessarily > absolute smallest) pack from it in the least amount of CPU and > disk time necessary to do the job. I would hope that nobody would > seriously consider dumping a GIT repository back INTO this format! > > [snip] > > AFAIK the svn code doesn't do merge commits. We probably need a post > > processing pass in the git repo that finds the merges and closes off > > the branches. gitk won't be pretty with 1,500 open branches. This may > > need some manual clues. > > *wince* 1500 open branches. Youch. OK, that answers a lot of > questions for me with regards to memory handling in fast-import. > Which you provide excellent suggestions for below. I guess I didn't > think you had nearly that many... > > [snip] > > The file names are used over and over. Alloc a giant chunk of memory > > and keep appending the file name strings to it. Then build a little > > tree so that you can look up existing names. i.e. turn the files names > > into atoms. Never delete anything. > > Agreed. For 1500 branches its worth doing. > > [snip] > > About 100,000 files in the initial change set that builds the repo. > > FInal repo has 120,000 files. > > > > There are 1,500 branches. I haven't looked at the svn dump file format > > for branches, but I suspect that it sends everything on a branch out > > at once and doesn't intersperse it with the trunk commits. > > If you can tell fast-import your are completely done processing a > branch I can recycle the memory I have tied up for that branch; but > if that's going to be difficult then... hmm. > > Right now I'm looking at around 5 MB/branch, based on implementing > the memory handling optimizations you suggested. That's still *huge* > for 1500 branches. I clearly can't hang onto every branch in memory > for the entire life of the import like I was planning on doing. > I'll kick that around for a couple of hours and see what I come > up with. Some of these branches are what cvs2svn calls unlabeled branches. cvs2svn is probably creating more of these than necessary since the code for coalescing them into a single big unlabeled branch is not that good. I attached the list of branch names being generated. > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com [-- Attachment #2: cvs2svn-branches.txt.bz2 --] [-- Type: application/x-bzip2, Size: 24830 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 4:48 ` Jon Smirl 2006-08-07 5:04 ` Shawn Pearce @ 2006-08-07 5:10 ` Martin Langhoff 1 sibling, 0 replies; 15+ messages in thread From: Martin Langhoff @ 2006-08-07 5:10 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On 8/7/06, Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/6/06, Shawn Pearce <spearce@spearce.org> wrote: > > So the new version should take about 20 MB of memory and should > > produce a valid pack and index in the same time as it does only > > the pack now. Plus it won't generate duplicates. > > I did a run with this and it works great. Great. > I'm staring at the cvs2svn code now trying to figure out how to modify > it without rewriting everything. I may just leave it all alone and > build a table with cvs_file:rev to sha-1 mappings. It would be much > more efficient to carry sha-1 throughout the stages but that may > require significant rework. Probably a good thing. If the patch isn't intrusive it has a better chance of merging well over time and of being merged upstream -- ;-) m ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-06 18:03 ` Shawn Pearce 2006-08-07 4:48 ` Jon Smirl @ 2006-08-07 7:57 ` Ryan Anderson 2006-08-07 23:02 ` Shawn Pearce 1 sibling, 1 reply; 15+ messages in thread From: Ryan Anderson @ 2006-08-07 7:57 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git On Sun, Aug 06, 2006 at 02:03:24PM -0400, Shawn Pearce wrote: > > - It expects an estimated object count as its second parameter. > In your case this would be something around 760000. This tells > it how large of an object table to allocate, with each entry > being 24 bytes + 1 pointer (28 or 32 bytes). Overshooting > this number will cause it to degrade by allocating one > overflow entry at a time from malloc. Hrm, you're allocating a big table and then assigning consecutive entries out of it, as pointers. Why not just malloc a big block, and assign offsets into it, as if it were a really big array. Every time it runs out, realloc it to double the current size, and update the base pointer. -- Ryan Anderson sometimes Pug Majere ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fast-import and unique objects. 2006-08-07 7:57 ` Ryan Anderson @ 2006-08-07 23:02 ` Shawn Pearce 0 siblings, 0 replies; 15+ messages in thread From: Shawn Pearce @ 2006-08-07 23:02 UTC (permalink / raw) To: Ryan Anderson; +Cc: Jon Smirl, git Ryan Anderson <ryan@michonline.com> wrote: > On Sun, Aug 06, 2006 at 02:03:24PM -0400, Shawn Pearce wrote: > > > > - It expects an estimated object count as its second parameter. > > In your case this would be something around 760000. This tells > > it how large of an object table to allocate, with each entry > > being 24 bytes + 1 pointer (28 or 32 bytes). Overshooting > > this number will cause it to degrade by allocating one > > overflow entry at a time from malloc. > > Hrm, you're allocating a big table and then assigning consecutive > entries out of it, as pointers. > > Why not just malloc a big block, and assign offsets into it, as if it > were a really big array. Every time it runs out, realloc it to double > the current size, and update the base pointer. Because I didn't want to move a 24 MB block of memory. :-) I'm probably going to clean that section of code up tonight and allocate a large block at the beginning then allocate overflow blocks at about 5000 entries at a time. There's no need for the blocks to be contiguous in memory, I just didn't want to have a high overhead from malloc when there would be a large number of them... -- Shawn. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-08-08 23:56 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-06 12:32 fast-import and unique objects Jon Smirl 2006-08-06 15:53 ` Jon Smirl 2006-08-06 18:03 ` Shawn Pearce 2006-08-07 4:48 ` Jon Smirl 2006-08-07 5:04 ` Shawn Pearce 2006-08-07 14:37 ` Jon Smirl 2006-08-07 14:48 ` Jakub Narebski 2006-08-07 18:45 ` Jon Smirl 2006-08-08 3:12 ` Shawn Pearce 2006-08-08 12:11 ` Jon Smirl 2006-08-08 22:45 ` Shawn Pearce 2006-08-08 23:56 ` Jon Smirl 2006-08-07 5:10 ` Martin Langhoff 2006-08-07 7:57 ` Ryan Anderson 2006-08-07 23:02 ` Shawn Pearce
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).