* Creating objects manually and repack @ 2006-08-04 3:43 Jon Smirl 2006-08-04 3:58 ` Jeff King 2006-08-04 4:01 ` Linus Torvalds 0 siblings, 2 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-04 3:43 UTC (permalink / raw) To: git I've made 500K object files with my cvs2svn front end. This is 500K of revision files and no tree files. Now I run get-repack. It says done counting zero objects. What needs to be update so that repack will find all of my objects? git-fsck isn't happy either since I have no HEAD. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 3:43 Creating objects manually and repack Jon Smirl @ 2006-08-04 3:58 ` Jeff King 2006-08-04 4:01 ` Linus Torvalds 1 sibling, 0 replies; 36+ messages in thread From: Jeff King @ 2006-08-04 3:58 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Thu, Aug 03, 2006 at 11:43:42PM -0400, Jon Smirl wrote: > I've made 500K object files with my cvs2svn front end. This is 500K of > revision files and no tree files. Now I run get-repack. It says done > counting zero objects. What needs to be update so that repack will > find all of my objects? git-repack starts at your heads and works its way down. You can either: - make a dummy commit for a tree with all of your blobs: $ while read sha1; do echo -e "100644 blob $sha1\t$sha1" done <list_of_sha1s | git-update-index --index-info tree=$(git-write-tree) commit=$(git-commit-tree $tree) git-update-ref HEAD $commit - call git-pack-objects directly with a list of objects git-pack-objects .git/objects/pack/pack <list_of_sha1s Obviously the latter is simpler, but the former will also make git-fsck-objects happy. Note that they're both untested, so there might be typos. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 3:43 Creating objects manually and repack Jon Smirl 2006-08-04 3:58 ` Jeff King @ 2006-08-04 4:01 ` Linus Torvalds 2006-08-04 4:24 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 4:01 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Thu, 3 Aug 2006, Jon Smirl wrote: > > I've made 500K object files with my cvs2svn front end. This is 500K of > revision files and no tree files. Now I run get-repack. It says done > counting zero objects. What needs to be update so that repack will > find all of my objects? Just enumerate them by hand, and pass the list off to git-pack-objects. IOW, you can _literally_ do something like this (cd .git/objects ; find . -type f -name '[0-9a-f]*' | tr -d '\./') | git-pack-objects tmp-pack and it will generate a pack-file and index (called "tmp-pack-*.pack" and "tmp-pack-*.idx" respectively) that contains all your lose objects. Now, that said, pack-file will generally _suck_ if you actually do it like the above. You actually want to pass in the object names _together_ with the filenames they were generated from, so that git-pack-objects can use its heuristics for finding good delta candidates. So what you actually want to do is pass in a set of object names with the name of the file they came with (space in between). See for example git-rev-list --objects HEAD^.. output for how something like that might look (git-pack-objects is designed to take the "git-rev-list --objects" output as its input). > git-fsck isn't happy either since I have no HEAD. Yeah, you cannot (and mustn't) run anything like git-fsck-objects or "git prune" until you've connected them all up somehow. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 4:01 ` Linus Torvalds @ 2006-08-04 4:24 ` Jon Smirl 2006-08-04 4:46 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-04 4:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: git I am converting all of the revisions from each CVS file into git objects the first time the file is parsed. The plan was to run repack after each file is finished. That way it should be easy to figure out the deltas since everything will be a variation on the same file. So what's the best way to pack these objects, append them to the existing pack and then clean everything up for the next file? I am parsing 120K CVS files containing over 1M revs. After I get all of the objects written and packed later code is going to write out the trees. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 4:24 ` Jon Smirl @ 2006-08-04 4:46 ` Linus Torvalds 2006-08-04 5:01 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 4:46 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Fri, 4 Aug 2006, Jon Smirl wrote: > > I am converting all of the revisions from each CVS file into git > objects the first time the file is parsed. The plan was to run repack > after each file is finished. That way it should be easy to figure out > the deltas since everything will be a variation on the same file. Sure. In that case, just list the object ID's in the exact same order you created them. Basically,as you create them, just keep a list of all ID's you've created, and every (say) 50,000 objects, just do a echo all objects you've created | git-pack-objects new-pack and then move the new pack into place, and remove all the loose objects (don't even bother using "git prune" - just basically do something like "rm -rf .git/objects/??" to get rid of them). > So what's the best way to pack these objects, append them to the > existing pack and then clean everything up for the next file? I am > parsing 120K CVS files containing over 1M revs. You'll want to repack every once in a while just to not ever have _tons_ of those loose objects around, but if you do it every 50,000 objects, you'll have just twenty nice pack-files once you're done, containing all one million objects, and you'll never have had more than ~200 files in any of the loose object subdirectories. Of course, you might want to make that "every 50,000 object" thing tunable, so that if you don't have a lot of memory for caching, you might want to do it a bit more often just to make each repack go faster and not have tons of IO. You can then do a _full_ repack to get one big object, by just listing every object you ever created (in creation order) to git-pack-objects, and then you can replace all the twenty (smaller) pack-files with the resulting single bigger one. In fact, at that point you no longer even need to worry about "creation order", since you've basically created all the deltas in the first phase, and regardless of ordering, when you then repack everything at the end, it will re-use all earlier delta information. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 4:46 ` Linus Torvalds @ 2006-08-04 5:01 ` Linus Torvalds 2006-08-04 5:11 ` Jon Smirl 2006-08-04 14:40 ` Jon Smirl 0 siblings, 2 replies; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 5:01 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Thu, 3 Aug 2006, Linus Torvalds wrote: > > Sure. In that case, just list the object ID's in the exact same order you > created them. Btw, you still want to give a filename for each object you've created, so that the delta sorter does the right thing for the packing. It doesn't have to be a _real_ filename - just make sure that each revision that comes from the same file has a filename that matches all the other revisions from that file. What the filename actually _is_ doesn't much matter, and it doesn't have to be the "real" filename that was associated with that set of revisions, since we'll just end up hashing it anyway. So it could be some "SVN inode number" for that set of revisions or something, for all git-pack-object cares. So you could just go through each SVN file in whatever the SVN database is (I don't know how SVN organizes it), generate every revisions for that file, and pass in the SVN _database_ filename, rather than necessarily the filename that that revision is actually associated with when checked out. So for example, if SVN were to use the same kind of "Attic/filename,v" format that CVS uses, there's no reason to worry what the real filename was in any particular checked out tree, you could just pass git-pack-objects a series of lines in the form of .. <sha1-object-name-of-rev1> Attic/filename,v <sha1-object-name-of-rev2> Attic/filename,v <sha1-object-name-of-rev3> Attic/filename,v .. as input on its stdin, and it will create a pack-file of all the objects you name, and use the "Attic/filename,v" info as the deltifier hint to know to do all the deltas of those revs against each other rather than against random other objects. The fact that the file was actually checked out as "src/filename" (and, since SVN supports renaming, it might have been checked out under any number of _other_ names over the history of the project) doesn't matter, and you don't need to even try to figure that out. git-pack-objects wouldn't care anyway. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 5:01 ` Linus Torvalds @ 2006-08-04 5:11 ` Jon Smirl 2006-08-04 14:40 ` Jon Smirl 1 sibling, 0 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-04 5:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 3 Aug 2006, Linus Torvalds wrote: > > > > Sure. In that case, just list the object ID's in the exact same order you > > created them. > > Btw, you still want to give a filename for each object you've created, so I'll add a file name hint. I'm converting the cvs2svn tool to do cvs2git. Martin has a copy of it up under git. I haven't checked in any of my changes yet. http://git.catalyst.net.nz/gitweb?p=cvs2svn.git;a=summary If you read the log it is obvious that these guys have done major work to deal with all kinds of broken CVS repositories. I want to piggyback on that work and reuse their code that builds change sets. So far this is the only tool I have found that can import the Mozilla CVS without errors. Only problem is that it imports it to SVN instead of git. I'm fixing that and learning Python at the same time. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 5:01 ` Linus Torvalds 2006-08-04 5:11 ` Jon Smirl @ 2006-08-04 14:40 ` Jon Smirl 2006-08-04 14:50 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-04 14:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: git One thing is obvious, I need to tune the repacks to happen before things spill out of the cache. git repack-objects has been chugging away for 2hrs now at 2% CPU and 3000 io/sec. It is in one of those modes where it went back to get the early stuff and in the process of getting that it knocked the later stuff out of the cache basically rendering the cache useless. I'm making good progress with this. I have hit two bugs in cvs2svn that I will need to get fixed. cvs2svn is claiming two of the ,v files to be invalid but to my eyes they look ok. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 14:40 ` Jon Smirl @ 2006-08-04 14:50 ` Jon Smirl 2006-08-04 15:22 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-04 14:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: git The whole problem with CVS import is avoiding getting IO bound. Since Mozilla CVS expands into 20GB when the revisions are separated out doing all that IO takes a lot of time. When these imports take four days it is all IO time, not CPU. Could repack-objects be modified to take the objects on stdin as I generate them instead of me putting them into the file system and then deleting them? That model would avoid many gigabytes of IO. It might work to just stream the output from zlib into repack-objects and let it recompute the object name. Or could I just stream in the uncompressed objects? I can still compute the object sha name in my code so that I can find it later. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 14:50 ` Jon Smirl @ 2006-08-04 15:22 ` Linus Torvalds 2006-08-04 15:41 ` Jon Smirl 0 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 15:22 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Fri, 4 Aug 2006, Jon Smirl wrote: > > Could repack-objects be modified to take the objects on stdin as I > generate them instead of me putting them into the file system and then > deleting them? That model would avoid many gigabytes of IO. I'd suggest against it, but you can (and should) just repack often enough that you shouldn't ever have gigabytes of objects "in flight". I'd have expected that with a repack every few ten thousand files, and most files being on the order of a few kB, you'd have been more than ok, but especially if you have large files, you may want to make things "every <n> bytes" rather than "every <n> files". You _could_ also decide to create packs very aggressively indeed, and if you do them quickly enough, the raw objects never even get written back to disk before you delete them. That will leave you with a lot of packs, but you could then "repack the packs" every once in a while. That said, it's obviously not _impossible_ to do what you suggest, it's just major surgery to pack-objects (which I'm not going to have time to do, since I'll be going on a vacation this weekend). Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 15:22 ` Linus Torvalds @ 2006-08-04 15:41 ` Jon Smirl 2006-08-04 16:01 ` A Large Angry SCM ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-04 15:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > I'd suggest against it, but you can (and should) just repack often enough > that you shouldn't ever have gigabytes of objects "in flight". I'd have > expected that with a repack every few ten thousand files, and most files > being on the order of a few kB, you'd have been more than ok, but > especially if you have large files, you may want to make things "every <n> > bytes" rather than "every <n> files". How about forking off a pack-objects and handing it one file name at a time over a pipe. When I hand it the next file name I delete the first file. Does pack-objects make multiple passes over the files? This model would let me hand it all 1M files. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 15:41 ` Jon Smirl @ 2006-08-04 16:01 ` A Large Angry SCM 2006-08-04 16:11 ` Jon Smirl 2006-08-04 16:56 ` Linus Torvalds 2006-08-04 16:39 ` Rogan Dawes 2006-08-04 16:53 ` Linus Torvalds 2 siblings, 2 replies; 36+ messages in thread From: A Large Angry SCM @ 2006-08-04 16:01 UTC (permalink / raw) To: git; +Cc: Jon Smirl, Linus Torvalds Jon Smirl wrote: > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: >> I'd suggest against it, but you can (and should) just repack often enough >> that you shouldn't ever have gigabytes of objects "in flight". I'd have >> expected that with a repack every few ten thousand files, and most files >> being on the order of a few kB, you'd have been more than ok, but >> especially if you have large files, you may want to make things "every >> <n> >> bytes" rather than "every <n> files". > > How about forking off a pack-objects and handing it one file name at a > time over a pipe. When I hand it the next file name I delete the first > file. Does pack-objects make multiple passes over the files? This > model would let me hand it all 1M files. > Why don't you just write the pack file directly? Pack files without deltas have a very simple structure, and git-index-pack will create a pack index file for the pack file you give it. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 16:01 ` A Large Angry SCM @ 2006-08-04 16:11 ` Jon Smirl 2006-08-04 16:32 ` Linus Torvalds 2006-08-04 16:56 ` Linus Torvalds 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-04 16:11 UTC (permalink / raw) To: gitzilla; +Cc: git, Linus Torvalds On 8/4/06, A Large Angry SCM <gitzilla@gmail.com> wrote: > Jon Smirl wrote: > > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > >> I'd suggest against it, but you can (and should) just repack often enough > >> that you shouldn't ever have gigabytes of objects "in flight". I'd have > >> expected that with a repack every few ten thousand files, and most files > >> being on the order of a few kB, you'd have been more than ok, but > >> especially if you have large files, you may want to make things "every > >> <n> > >> bytes" rather than "every <n> files". > > > > How about forking off a pack-objects and handing it one file name at a > > time over a pipe. When I hand it the next file name I delete the first > > file. Does pack-objects make multiple passes over the files? This > > model would let me hand it all 1M files. > > > > Why don't you just write the pack file directly? Pack files without > deltas have a very simple structure, and git-index-pack will create a > pack index file for the pack file you give it. That is under consideration but the undeltafied pack is about 12GB and it takes forever (about a day) to deltafy it. I'm not convinced yet that an undeltafied pack is any faster than just having the objects in the directories. The same data in a deltafied pack is 700MB. That is a tremendous difference in the amount of IO needed. The strategy has to be to avoid IO, nothing I am doing is ever CPU bound. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 16:11 ` Jon Smirl @ 2006-08-04 16:32 ` Linus Torvalds 0 siblings, 0 replies; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 16:32 UTC (permalink / raw) To: Jon Smirl; +Cc: gitzilla, git On Fri, 4 Aug 2006, Jon Smirl wrote: > > That is under consideration but the undeltafied pack is about 12GB and > it takes forever (about a day) to deltafy it. I'm not convinced yet > that an undeltafied pack is any faster than just having the objects in > the directories. Yeah, I think it's worth it deltifying things early, as you seem to get all the object info in the right order anyway (ie you do the revisions for one file in one go). Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 16:01 ` A Large Angry SCM 2006-08-04 16:11 ` Jon Smirl @ 2006-08-04 16:56 ` Linus Torvalds 1 sibling, 0 replies; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 16:56 UTC (permalink / raw) To: A Large Angry SCM; +Cc: git, Jon Smirl On Fri, 4 Aug 2006, A Large Angry SCM wrote: > > Why don't you just write the pack file directly? Pack files without deltas > have a very simple structure, and git-index-pack will create a pack index file > for the pack file you give it. Pack-files without deltas are really huge. You really really don't want to do this for some medium-large file that has several thousand revisions. The reason you want to generate the deltas early is that then, once you've generated all the simple and obvious deltas (and within each *,v file from CVS, they are all simple and obvious), doing a "git repack -a -d" will be able to re-use the deltas you found, making it a much cheaper operation. NOTE! For that "git repack -a -d" to work, you'd obviously only do it at the very end, when you've tied together all the blobs with trees and commits (since "git repack" wants to follow the reachability chain). Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 15:41 ` Jon Smirl 2006-08-04 16:01 ` A Large Angry SCM @ 2006-08-04 16:39 ` Rogan Dawes 2006-08-04 16:53 ` Jon Smirl 2006-08-04 16:53 ` Linus Torvalds 2 siblings, 1 reply; 36+ messages in thread From: Rogan Dawes @ 2006-08-04 16:39 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl wrote: > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: >> I'd suggest against it, but you can (and should) just repack often enough >> that you shouldn't ever have gigabytes of objects "in flight". I'd have >> expected that with a repack every few ten thousand files, and most files >> being on the order of a few kB, you'd have been more than ok, but >> especially if you have large files, you may want to make things "every >> <n> >> bytes" rather than "every <n> files". > > How about forking off a pack-objects and handing it one file name at a > time over a pipe. When I hand it the next file name I delete the first > file. Does pack-objects make multiple passes over the files? This > model would let me hand it all 1M files. > I'd imagine that this would not necessarily save you a lot, if you have to write it to disk, and then read it back again. Your only chance here is if you stay in the buffer, and avoid actually writing to disk at all. Of course, using a ramdisk/tmpfs for your object directories might be enough to save you. Just use a symlink to tmpfs for the objects directory, and leave the pack files on persistent storage. That doesn't answer your question about how many passes pack-objects does. Nicholas Pitre should be able to answer that. Rogan ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 16:39 ` Rogan Dawes @ 2006-08-04 16:53 ` Jon Smirl 0 siblings, 0 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-04 16:53 UTC (permalink / raw) To: Rogan Dawes; +Cc: git On 8/4/06, Rogan Dawes <discard@dawes.za.net> wrote: > Jon Smirl wrote: > > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > >> I'd suggest against it, but you can (and should) just repack often enough > >> that you shouldn't ever have gigabytes of objects "in flight". I'd have > >> expected that with a repack every few ten thousand files, and most files > >> being on the order of a few kB, you'd have been more than ok, but > >> especially if you have large files, you may want to make things "every > >> <n> > >> bytes" rather than "every <n> files". > > > > How about forking off a pack-objects and handing it one file name at a > > time over a pipe. When I hand it the next file name I delete the first > > file. Does pack-objects make multiple passes over the files? This > > model would let me hand it all 1M files. > > > > I'd imagine that this would not necessarily save you a lot, if you have > to write it to disk, and then read it back again. Your only chance here > is if you stay in the buffer, and avoid actually writing to disk at all. If I keep creating files, reading them and then deleting them then it is likely that the same blocks are being used over and over. Since the blocks are reused it will stop the cache thrashing. Some disk writes will still happen but that is way better than doing 12GB of unique writes followed by 12GB of reads. The 24GB of IO is all reads on small files so it is seek time limited since repack does writes in the middle of the reads. > Of course, using a ramdisk/tmpfs for your object directories might be > enough to save you. Just use a symlink to tmpfs for the objects > directory, and leave the pack files on persistent storage. The unpacked set of objects is way to big to fit into RAM. Any scheme using the unpacked objects will spill to disk. > That doesn't answer your question about how many passes pack-objects > does. Nicholas Pitre should be able to answer that. > > Rogan > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 15:41 ` Jon Smirl 2006-08-04 16:01 ` A Large Angry SCM 2006-08-04 16:39 ` Rogan Dawes @ 2006-08-04 16:53 ` Linus Torvalds 2006-08-04 17:17 ` Jon Smirl 2 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 16:53 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Fri, 4 Aug 2006, Jon Smirl wrote: > > How about forking off a pack-objects and handing it one file name at a > time over a pipe. When I hand it the next file name I delete the first > file. Does pack-objects make multiple passes over the files? This > model would let me hand it all 1M files. pack-objects does actually make several (well, two) passes over the objects right now, because it first does all the sorting based on object size/type, and then does the actual deltifying pass. But doing things one file-name at a time would certainly be fine. You can even do it with git-pack-objects running in parallel, ie you can do a for_each_filename() { cvs-generate-objects filename | git-pack-objects filename rm -rf .git/objects/??/ } and then "cvs-generate-objects" should just make sure that it writes the git object _before_ it actually outputs the object name on stdout. And if you do it this way, you won't even have to pass any filenames, since git-pack-objects will only get objects for the same file, and will do the right thing just sorting them by size. So in the above kind of setting, the _only_ thing that cvs-generate-objects needs to do is: for_each_rev(file) { unsigned char sha1[20]; unsigned long len; void *buf; /* unpack the revision into memory */ buf = cvs_unpack_revision(&len); /* Write it out as a git blob file */ write_sha1_file(buf, len, "blob", sha1); /* Free the memory image */ free(buf); /* Tell git-pack-objects the name of the git blob */ printf("%s\n", sha1_to_hex(sha1)); } and you're basically all done. The above would turn each *,v file into a *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many pack-files as you have *,v files. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 16:53 ` Linus Torvalds @ 2006-08-04 17:17 ` Jon Smirl 2006-08-04 17:29 ` Linus Torvalds 2006-08-05 4:15 ` Martin Langhoff 0 siblings, 2 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-04 17:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > and you're basically all done. The above would turn each *,v file into a > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many > pack-files as you have *,v files. I'll end up with 110,000 pack files. I suspect when I run repack over that it is going to take 24hrs or more, but maybe not since everything may be small enough to run in RAM. We'll also get to see the performance of repack with 110K open file handles. How is it going to figure out which file handle contains which objects? A new tool might help. It would concatenate the pack files (while adjusting the headers) and then build a single index. No attempt at searching for deltas. To initially build a single pack file it looks like I need a version of repack that works in a single pass over the input files. To make things simple it would just delete the file when it has finished reading it. Since I'm passing in the revisions in optimal order sorting them probably hurts the pack size. The number of files in flight will be a function of the pipe buffer size and file names. I'll work on the tree writing code over the week end. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 17:17 ` Jon Smirl @ 2006-08-04 17:29 ` Linus Torvalds 2006-08-04 18:06 ` Linus Torvalds 2006-08-05 4:15 ` Martin Langhoff 1 sibling, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 17:29 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Fri, 4 Aug 2006, Jon Smirl wrote: > > I'll end up with 110,000 pack files. I suspect when I run repack over > that it is going to take 24hrs or more, but maybe not since everything > may be small enough to run in RAM. You may definitely want to pack the pack-files together every once in a while. Doing so is not that hard: just list all the objects in all the pack-files you want to merge, which in turn is trivial from reading the index of the pack-files (and then you do want to do the filename, although you can just use the pack-file name if you want to). But yeah, it's going to be expensive whatever you do. It's a big repo. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 17:29 ` Linus Torvalds @ 2006-08-04 18:06 ` Linus Torvalds 2006-08-04 18:24 ` Junio C Hamano 0 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 18:06 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Fri, 4 Aug 2006, Linus Torvalds wrote: > > You may definitely want to pack the pack-files together every once in a > while. Doing so is not that hard: just list all the objects in all the > pack-files you want to merge, which in turn is trivial from reading the > index of the pack-files (and then you do want to do the filename, > although you can just use the pack-file name if you want to). Btw, that index format is actually documented (and it really is _very_ simple) in Documentation/technical/pack-format.txt. To get a list of all object names in a pack-file, you'd basically do just something like the appended. So with this (let's call it "git-list-objects"), you could just do for i in $packlist do git-list-objects $i.idx done | git-pack-objects combined-pack and it would combine all the packs in "$packlist" into one new "combined-pack-<sha1>" pack. And no, I didn't actually _test_ any of this, but it looks pretty damn simple. Linus ---- #include <unistd.h> #include <fcntl.h> #include <stdio.h> #define CHUNK (100) int main(int argc, char **argv) { static unsigned char buffer[24*CHUNK]; const char *name = argv[1]; unsigned int n; int fd; int i; if (!name) die("no filename!"); fd = open(name, O_RDONLY); if (fd < 0) perror(name); /* throw away the first-level fan-out */ if (read(fd, buffer, 4*256) != 4*256) perror("read fan-out"); n = (buffer[4*255 + 0] << 24) + (buffer[4*255 + 1] << 16) + (buffer[4*255 + 2] << 8) + (buffer[4*255 + 3] << 0); for (i = 0; i < n; i += CHUNK) { int j, left = n - i; if (left > CHUNK) left = CHUNK; if (read(fd, buffer, left*24) != left*24) perror("read chunk"); for (j = 0; j < left; j++) { const unsigned char *sha1; sha1 = buffer + j*24 + 4; printf("%s %s\n", sha1_to_hex(sha1), name); } } return 0; } ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 18:06 ` Linus Torvalds @ 2006-08-04 18:24 ` Junio C Hamano 2006-08-04 19:20 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Junio C Hamano @ 2006-08-04 18:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 4 Aug 2006, Linus Torvalds wrote: >> >> You may definitely want to pack the pack-files together every once in a >> while. Doing so is not that hard: just list all the objects in all the >> pack-files you want to merge, which in turn is trivial from reading the >> index of the pack-files (and then you do want to do the filename, >> although you can just use the pack-file name if you want to). That would only work *once*, because the resulting pack would now have blobs from two or more different files and you cannot tell them apart. So in order to collapse 110k packs into one, you would pack packs into one every 330 packs, create trees and commits for connectivity, and run the final repack -a -d over the result, or something like that, I suppose... > Btw, that index format is actually documented (and it really is _very_ > simple) in Documentation/technical/pack-format.txt. > > To get a list of all object names in a pack-file, you'd basically do just > something like the appended. git-show-index? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 18:24 ` Junio C Hamano @ 2006-08-04 19:20 ` Linus Torvalds 2006-08-04 19:31 ` Carl Worth 0 siblings, 1 reply; 36+ messages in thread From: Linus Torvalds @ 2006-08-04 19:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 4 Aug 2006, Junio C Hamano wrote: > > That would only work *once*, because the resulting pack would > now have blobs from two or more different files and you cannot > tell them apart. You don't care. You need to keep track of the blob names separately _anyway_: the pack information is not enough to re-create all the revision info. So clearly, to create the tree and commit objects, the cvsimport really needs to keep track of the objects it has created, and what their relationship is, and it needs to do that separately. The pack-file just contains the contents, so that you only ever afterwards need to worry about the 20-byte SHA1, not the actual file itself. > > To get a list of all object names in a pack-file, you'd basically do just > > something like the appended. > > git-show-index? Yeah, that might be good. Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 19:20 ` Linus Torvalds @ 2006-08-04 19:31 ` Carl Worth 2006-08-04 19:57 ` Junio C Hamano 0 siblings, 1 reply; 36+ messages in thread From: Carl Worth @ 2006-08-04 19:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git [-- Attachment #1: Type: text/plain, Size: 314 bytes --] On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote: > > > To get a list of all object names in a pack-file, you'd basically do just > > > something like the appended. > > > > git-show-index? > > Yeah, that might be good. That clashes pretty badly with update-index. git-show-pack-index perhaps? -Carl [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 19:31 ` Carl Worth @ 2006-08-04 19:57 ` Junio C Hamano 2006-08-04 20:08 ` Carl Worth ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: Junio C Hamano @ 2006-08-04 19:57 UTC (permalink / raw) To: Carl Worth; +Cc: git Carl Worth <cworth@cworth.org> writes: > On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote: >> > > To get a list of all object names in a pack-file, you'd basically do just >> > > something like the appended. >> > >> > git-show-index? >> >> Yeah, that might be good. > > That clashes pretty badly with update-index. git-show-pack-index > perhaps? There _already_ is a command called git-show-index, since early July last year ;-). ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 19:57 ` Junio C Hamano @ 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:12 ` Jakub Narebski 2 siblings, 0 replies; 36+ messages in thread From: Carl Worth @ 2006-08-04 20:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: git [-- Attachment #1: Type: text/plain, Size: 399 bytes --] On Fri, 04 Aug 2006 12:57:25 -0700, Junio C Hamano wrote: > > That clashes pretty badly with update-index. git-show-pack-index > > perhaps? > > There _already_ is a command called git-show-index, since early > July last year ;-). Ah, don't mind me then. Just another one of my typical public displays of ignorance. Time to go back and work harder on memorizing that list of 120+ commands... -Carl [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 19:57 ` Junio C Hamano 2006-08-04 20:08 ` Carl Worth @ 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:12 ` Jakub Narebski 2 siblings, 0 replies; 36+ messages in thread From: Carl Worth @ 2006-08-04 20:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: git [-- Attachment #1: Type: text/plain, Size: 394 bytes --] On Fri, 04 Aug 2006 12:57:25 -0700, Junio C Hamano wrote: > > That clashes pretty badly with update-index. git-show-pack-index > > perhaps? > > There _already_ is a command called git-show-index, since early > July last year ;-). Ah, don't mind me then. Just another one of my public displays of incompetence. Time to go back and work harder on memorizing that list of 120+ commands... -Carl [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 19:57 ` Junio C Hamano 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:08 ` Carl Worth @ 2006-08-04 20:12 ` Jakub Narebski 2006-08-04 20:30 ` Junio C Hamano 2 siblings, 1 reply; 36+ messages in thread From: Jakub Narebski @ 2006-08-04 20:12 UTC (permalink / raw) To: git Junio C Hamano wrote: > Carl Worth <cworth@cworth.org> writes: > >> On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote: >>> > > To get a list of all object names in a pack-file, you'd basically do just >>> > > something like the appended. >>> > >>> > git-show-index? >>> >>> Yeah, that might be good. >> >> That clashes pretty badly with update-index. git-show-pack-index >> perhaps? > > There _already_ is a command called git-show-index, since early > July last year ;-). Parhaps it should be renamed then, for consistency? -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 20:12 ` Jakub Narebski @ 2006-08-04 20:30 ` Junio C Hamano 2006-08-04 20:37 ` Jakub Narebski 0 siblings, 1 reply; 36+ messages in thread From: Junio C Hamano @ 2006-08-04 20:30 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com> writes: > Junio C Hamano wrote: > >> There _already_ is a command called git-show-index, since early >> July last year ;-). > > Parhaps it should be renamed then, for consistency? There isn't anything to make consistent. We use the word "index" to mean both dircache and the pack index files. Usually you can tell between the two usage from context. We could rename it to git-show-pack-index, but the command has primarily been the debugging aid and not for real use, and I was actually thinking about removing it, perhaps until now ;-). ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 20:30 ` Junio C Hamano @ 2006-08-04 20:37 ` Jakub Narebski 0 siblings, 0 replies; 36+ messages in thread From: Jakub Narebski @ 2006-08-04 20:37 UTC (permalink / raw) To: git Junio C Hamano wrote: > Jakub Narebski <jnareb@gmail.com> writes: > >> Junio C Hamano wrote: >> >>> There _already_ is a command called git-show-index, since early >>> July last year ;-). >> >> Parhaps it should be renamed then, for consistency? > > There isn't anything to make consistent. We use the word > "index" to mean both dircache and the pack index files. > Usually you can tell between the two usage from context. Oops. I should say, for better readibility (to not depend on context), then. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-04 17:17 ` Jon Smirl 2006-08-04 17:29 ` Linus Torvalds @ 2006-08-05 4:15 ` Martin Langhoff 2006-08-05 5:12 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Martin Langhoff @ 2006-08-05 4:15 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, git On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > > and you're basically all done. The above would turn each *,v file into a > > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many > > pack-files as you have *,v files. > > I'll end up with 110,000 pack files. Then just do it every 100 files, and you'll only have 1,100 pack files, and it'll be fine. > I suspect when I run repack over > that it is going to take 24hrs or more, Probably, but only the initial import has to incur that huge cost. cheers, martin ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-05 4:15 ` Martin Langhoff @ 2006-08-05 5:12 ` Jon Smirl 2006-08-05 5:21 ` Shawn Pearce 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-05 5:12 UTC (permalink / raw) To: Martin Langhoff; +Cc: Linus Torvalds, git On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote: > > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > and you're basically all done. The above would turn each *,v file into a > > > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many > > > pack-files as you have *,v files. > > > > I'll end up with 110,000 pack files. > > Then just do it every 100 files, and you'll only have 1,100 pack > files, and it'll be fine. This is something that has to be tuned. If you wait too long everything spills out of RAM and you go totally IO bound for days. If you do it too often you end up with too many packs and it takes a day to repack them. If I had a way to pipe the all of the objects into repack one at a time without repack doing multiple passes none of this tuning would be necessary. In this model the standalone objects never get created in the first place. The fastest IO is IO that has been eliminated. > > I suspect when I run repack over > > that it is going to take 24hrs or more, > > Probably, but only the initial import has to incur that huge cost. Mozilla developers aren't all rushing to switch to git. A switch needs to be as painless as possible. If things are too complex they simply won't switch. Switching Mozilla to git is going to require a sales job and proof that the tools are reliable and better than CVS. Right now I can't even reliably import Mozilla CVS. One of the conditions for even considering git is that they can easily do the CVS import internally and verify it for accuracy. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-05 5:12 ` Jon Smirl @ 2006-08-05 5:21 ` Shawn Pearce 2006-08-05 5:40 ` Jon Smirl 2006-08-05 5:46 ` Shawn Pearce 0 siblings, 2 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-05 5:21 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > >On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote: > >> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > >> > and you're basically all done. The above would turn each *,v file into > >a > >> > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many > >> > pack-files as you have *,v files. > >> > >> I'll end up with 110,000 pack files. > > > >Then just do it every 100 files, and you'll only have 1,100 pack > >files, and it'll be fine. > > This is something that has to be tuned. If you wait too long > everything spills out of RAM and you go totally IO bound for days. If > you do it too often you end up with too many packs and it takes a day > to repack them. > > If I had a way to pipe the all of the objects into repack one at a > time without repack doing multiple passes none of this tuning would be > necessary. In this model the standalone objects never get created in > the first place. The fastest IO is IO that has been eliminated. I'm almost done with what I'm calling `git-fast-import`. It takes a stream of blobs on STDIN and writes the pack to a file, printing SHA1s in hex format to STDOUT. The basic format for STDIN is a 4 byte length (native format) followed by that many bytes of blob data. It prints the SHA1 for that blob to STDOUT, then waits for another length. It naively deltas each object against the prior object, thus it would be best to feed it one ,v file at a time working from the most recent revision back to the oldest revision. This works well for an RCS file as that's the natural order to process the file in. :-) When done you close STDIN and it'll rip through and update the pack object count and the trailing checksum. This should let you pack the entire repository in delta format using only two passes over the data: one to write out the pack file and one to compute its checksum. I'll post the code in a couple of hours. -- Shawn. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-05 5:21 ` Shawn Pearce @ 2006-08-05 5:40 ` Jon Smirl 2006-08-05 5:52 ` Shawn Pearce 2006-08-05 5:46 ` Shawn Pearce 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-05 5:40 UTC (permalink / raw) To: Shawn Pearce; +Cc: Martin Langhoff, Linus Torvalds, git On 8/5/06, Shawn Pearce <spearce@spearce.org> wrote: > I'm almost done with what I'm calling `git-fast-import`. It takes > a stream of blobs on STDIN and writes the pack to a file, printing > SHA1s in hex format to STDOUT. The basic format for STDIN is a 4 > byte length (native format) followed by that many bytes of blob data. > It prints the SHA1 for that blob to STDOUT, then waits for another > length. > > It naively deltas each object against the prior object, thus it > would be best to feed it one ,v file at a time working from the most > recent revision back to the oldest revision. This works well for > an RCS file as that's the natural order to process the file in. :-) I am already doing this. > When done you close STDIN and it'll rip through and update the pack > object count and the trailing checksum. This should let you pack > the entire repository in delta format using only two passes over the > data: one to write out the pack file and one to compute its checksum. Thinking about this some more, the existing repack code could be made to work with minor changes. I would like to feed repack 1M revisions which are sorted by file and then newest to oldest. The problem is that my expanded revs take up 12GB disk space. How about adding a flag to repack that simply says delete the objects when done with them? I'd still create all of the objects on disk. Repack would assume that they have at least been sorted by filename. So repack could read in object names until it sees a change in the file name, sort them by size, deltafy, write out the pack and then delete the objects from that batch. Then repeat this process for the next file name on stdin. I'm making two assumptions, first that blocks from a deleted file don't get written to disk. And that by deleting the file the file system will use the same blocks over and over. If those assumptions are close to being true then the cache shouldn't thrash. They don't have to be totally true, close is good enough. Of course eliminating the files all together will be even faster. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-05 5:40 ` Jon Smirl @ 2006-08-05 5:52 ` Shawn Pearce 0 siblings, 0 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-05 5:52 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git Jon Smirl <jonsmirl@gmail.com> wrote: > How about adding a flag to repack that simply says delete the objects > when done with them? I'd still create all of the objects on disk. > Repack would assume that they have at least been sorted by filename. > So repack could read in object names until it sees a change in the > file name, sort them by size, deltafy, write out the pack and then > delete the objects from that batch. Then repeat this process for the > next file name on stdin. > > I'm making two assumptions, first that blocks from a deleted file > don't get written to disk. And that by deleting the file the file > system will use the same blocks over and over. If those assumptions > are close to being true then the cache shouldn't thrash. They don't > have to be totally true, close is good enough. > > Of course eliminating the files all together will be even faster. See the email I just sent you. The only file being written is the pack file that's being generated. No temporary files, no temporary inodes, no temporary blocks. Only two passes over the data: one to write it out and a second to generate the SHA1. I do two passes vs. keep it all in memory to prevent the program from blowing out on extremely large inputs. It may be possible to tweak git-pack-objects to get what you propose above, but to be honest I think the git-fast-import I just sent was easier, especially as it avoids the temporary loose object stage. -- Shawn. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Creating objects manually and repack 2006-08-05 5:21 ` Shawn Pearce 2006-08-05 5:40 ` Jon Smirl @ 2006-08-05 5:46 ` Shawn Pearce 1 sibling, 0 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-05 5:46 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git [-- Attachment #1: Type: text/plain, Size: 2432 bytes --] Shawn Pearce <spearce@spearce.org> wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > > >On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote: > > >> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote: > > >> > and you're basically all done. The above would turn each *,v file into > > >a > > >> > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many > > >> > pack-files as you have *,v files. > > >> > > >> I'll end up with 110,000 pack files. > > > > > >Then just do it every 100 files, and you'll only have 1,100 pack > > >files, and it'll be fine. > > > > This is something that has to be tuned. If you wait too long > > everything spills out of RAM and you go totally IO bound for days. If > > you do it too often you end up with too many packs and it takes a day > > to repack them. > > > > If I had a way to pipe the all of the objects into repack one at a > > time without repack doing multiple passes none of this tuning would be > > necessary. In this model the standalone objects never get created in > > the first place. The fastest IO is IO that has been eliminated. > > I'm almost done with what I'm calling `git-fast-import`. OK, now I'm done. I'm attaching the code. Toss it into the Makefile as git-fast-import and recompile. I tested it with the following Perl script, feeding the Perl script a list of files that I wanted blobs for on STDIN: while (<>) { chop; print pack('L', -s $_); open(F, $_); my $buf; print $buf while read(F,$buf,128*1024) > 0; close F; } This gave me an execution order of: find . -name '*.c' | perl test.pl | git-fast-import in.pack git-index-pack in.pack at which point in.pack claims to be a completely valid pack with an index of in.idx. Move these into .git/objects/pack, generate trees and commits, and run git-repack -a -d. If the order you feed the objects to git-fast-import in is reasonable (do one RCS file at a time, feed most recent to least recent revisions) you may not get any major benefit from using -f during your final repack. The code for git-fast-import could probably be tweaked to accept trees and commits too, which would permit you to stream the entire CVS repository into a single pack file. :-) I can't help you decompress the RCS files faster, but hopefully this will help you generate the GIT pack faster. Hopefully you can make use of it! -- Shawn. [-- Attachment #2: fast-import.c --] [-- Type: text/x-csrc, Size: 4659 bytes --] #include "builtin.h" #include "cache.h" #include "object.h" #include "blob.h" #include "delta.h" #include "pack.h" #include "csum-file.h" static int max_depth = 10; static unsigned long object_count; static int packfd; static int current_depth; static void *lastdat; static unsigned long lastdatlen; static unsigned char lastsha1[20]; static ssize_t yread(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xread(fd, (char *) buffer + ret, length - ret); if (size < 0) { return size; } if (size == 0) { return ret; } ret += size; } return ret; } static ssize_t ywrite(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret); if (size < 0) { return size; } if (size == 0) { return ret; } ret += size; } return ret; } static unsigned long encode_header(enum object_type type, unsigned long size, unsigned char *hdr) { int n = 1; unsigned char c; if (type < OBJ_COMMIT || type > OBJ_DELTA) die("bad type %d", type); c = (type << 4) | (size & 15); size >>= 4; while (size) { *hdr++ = c | 0x80; c = size & 0x7f; size >>= 7; n++; } *hdr = c; return n; } static void write_blob (void *dat, unsigned long datlen) { z_stream s; void *out, *delta; unsigned char hdr[64]; unsigned long hdrlen, deltalen; if (lastdat && current_depth < max_depth) { delta = diff_delta(lastdat, lastdatlen, dat, datlen, &deltalen, 0); } else delta = 0; memset(&s, 0, sizeof(s)); deflateInit(&s, zlib_compression_level); if (delta) { current_depth++; s.next_in = delta; s.avail_in = deltalen; hdrlen = encode_header(OBJ_DELTA, deltalen, hdr); if (ywrite(packfd, hdr, hdrlen) != hdrlen) die("Can't write object header: %s", strerror(errno)); if (ywrite(packfd, lastsha1, sizeof(lastsha1)) != sizeof(lastsha1)) die("Can't write object base: %s", strerror(errno)); } else { current_depth = 0; s.next_in = dat; s.avail_in = datlen; hdrlen = encode_header(OBJ_BLOB, datlen, hdr); if (ywrite(packfd, hdr, hdrlen) != hdrlen) die("Can't write object header: %s", strerror(errno)); } s.avail_out = deflateBound(&s, s.avail_in); s.next_out = out = xmalloc(s.avail_out); while (deflate(&s, Z_FINISH) == Z_OK) /* nothing */; deflateEnd(&s); if (ywrite(packfd, out, s.total_out) != s.total_out) die("Failed writing compressed data %s", strerror(errno)); free(out); if (delta) free(delta); } static void init_pack_header () { const char* magic = "PACK"; unsigned long version = 2; unsigned long zero = 0; version = htonl(version); if (ywrite(packfd, (char*)magic, 4) != 4) die("Can't write pack magic: %s", strerror(errno)); if (ywrite(packfd, &version, 4) != 4) die("Can't write pack version: %s", strerror(errno)); if (ywrite(packfd, &zero, 4) != 4) die("Can't write 0 object count: %s", strerror(errno)); } static void fixup_header_footer () { SHA_CTX c; char hdr[8]; unsigned char sha1[20]; unsigned long cnt; char *buf; size_t n; if (lseek(packfd, 0, SEEK_SET) != 0) die("Failed seeking to start: %s", strerror(errno)); SHA1_Init(&c); if (yread(packfd, hdr, 8) != 8) die("Failed reading header: %s", strerror(errno)); SHA1_Update(&c, hdr, 8); fprintf(stderr, "%lu objects\n", object_count); cnt = htonl(object_count); SHA1_Update(&c, &cnt, 4); if (ywrite(packfd, &cnt, 4) != 4) die("Failed writing object count: %s", strerror(errno)); buf = xmalloc(128 * 1024); for (;;) { n = xread(packfd, buf, 128 * 1024); if (n <= 0) break; SHA1_Update(&c, buf, n); } free(buf); SHA1_Final(sha1, &c); if (ywrite(packfd, sha1, sizeof(sha1)) != sizeof(sha1)) die("Failed writing pack checksum: %s", strerror(errno)); } int main (int argc, const char **argv) { packfd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666); if (packfd < 0) die("Can't create pack file %s: %s", argv[1], strerror(errno)); init_pack_header(); for (;;) { unsigned long datlen; int hdrlen; void *dat; char hdr[128]; unsigned char sha1[20]; SHA_CTX c; if (yread(0, &datlen, 4) != 4) break; dat = xmalloc(datlen); if (yread(0, dat, datlen) != datlen) break; hdrlen = sprintf(hdr, "blob %lu", datlen) + 1; SHA1_Init(&c); SHA1_Update(&c, hdr, hdrlen); SHA1_Update(&c, dat, datlen); SHA1_Final(sha1, &c); write_blob(dat, datlen); object_count++; printf("%s\n", sha1_to_hex(sha1)); fflush(stdout); if (lastdat) free(lastdat); lastdat = dat; lastdatlen = datlen; memcpy(lastsha1, sha1, sizeof(sha1)); } fixup_header_footer(); close(packfd); return 0; } ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2006-08-05 5:52 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-04 3:43 Creating objects manually and repack Jon Smirl 2006-08-04 3:58 ` Jeff King 2006-08-04 4:01 ` Linus Torvalds 2006-08-04 4:24 ` Jon Smirl 2006-08-04 4:46 ` Linus Torvalds 2006-08-04 5:01 ` Linus Torvalds 2006-08-04 5:11 ` Jon Smirl 2006-08-04 14:40 ` Jon Smirl 2006-08-04 14:50 ` Jon Smirl 2006-08-04 15:22 ` Linus Torvalds 2006-08-04 15:41 ` Jon Smirl 2006-08-04 16:01 ` A Large Angry SCM 2006-08-04 16:11 ` Jon Smirl 2006-08-04 16:32 ` Linus Torvalds 2006-08-04 16:56 ` Linus Torvalds 2006-08-04 16:39 ` Rogan Dawes 2006-08-04 16:53 ` Jon Smirl 2006-08-04 16:53 ` Linus Torvalds 2006-08-04 17:17 ` Jon Smirl 2006-08-04 17:29 ` Linus Torvalds 2006-08-04 18:06 ` Linus Torvalds 2006-08-04 18:24 ` Junio C Hamano 2006-08-04 19:20 ` Linus Torvalds 2006-08-04 19:31 ` Carl Worth 2006-08-04 19:57 ` Junio C Hamano 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:08 ` Carl Worth 2006-08-04 20:12 ` Jakub Narebski 2006-08-04 20:30 ` Junio C Hamano 2006-08-04 20:37 ` Jakub Narebski 2006-08-05 4:15 ` Martin Langhoff 2006-08-05 5:12 ` Jon Smirl 2006-08-05 5:21 ` Shawn Pearce 2006-08-05 5:40 ` Jon Smirl 2006-08-05 5:52 ` Shawn Pearce 2006-08-05 5:46 ` Shawn Pearce
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).