Creating objects manually and repack

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Creating objects manually and repack
@ 2006-08-04  3:43 Jon Smirl
  2006-08-04  3:58 ` Jeff King
  2006-08-04  4:01 ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-04  3:43 UTC (permalink / raw)
  To: git

I've made 500K object files with my cvs2svn front end. This is 500K of
revision files and no tree files. Now I run get-repack. It says done
counting zero objects. What needs to be update so that repack will
find all of my objects?

git-fsck isn't happy either since I have no HEAD.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  3:43 Creating objects manually and repack Jon Smirl
@ 2006-08-04  3:58 ` Jeff King
  2006-08-04  4:01 ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-08-04  3:58 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Thu, Aug 03, 2006 at 11:43:42PM -0400, Jon Smirl wrote:

> I've made 500K object files with my cvs2svn front end. This is 500K of
> revision files and no tree files. Now I run get-repack. It says done
> counting zero objects. What needs to be update so that repack will
> find all of my objects?

git-repack starts at your heads and works its way down. You can either:
  - make a dummy commit for a tree with all of your blobs:
    $ while read sha1; do
        echo -e "100644 blob $sha1\t$sha1"
      done <list_of_sha1s | git-update-index --index-info
      tree=$(git-write-tree)
      commit=$(git-commit-tree $tree)
      git-update-ref HEAD $commit

  - call git-pack-objects directly with a list of objects
      git-pack-objects .git/objects/pack/pack <list_of_sha1s

Obviously the latter is simpler, but the former will also make
git-fsck-objects happy. Note that they're both untested, so there might
be typos.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  3:43 Creating objects manually and repack Jon Smirl
  2006-08-04  3:58 ` Jeff King
@ 2006-08-04  4:01 ` Linus Torvalds
  2006-08-04  4:24   ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04  4:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Thu, 3 Aug 2006, Jon Smirl wrote:
>
> I've made 500K object files with my cvs2svn front end. This is 500K of
> revision files and no tree files. Now I run get-repack. It says done
> counting zero objects. What needs to be update so that repack will
> find all of my objects?

Just enumerate them by hand, and pass the list off to git-pack-objects.

IOW, you can _literally_ do something like this

	(cd .git/objects ; find . -type f -name '[0-9a-f]*' | tr -d '\./') |
		git-pack-objects tmp-pack

and it will generate a pack-file and index (called "tmp-pack-*.pack" and 
"tmp-pack-*.idx" respectively) that contains all your lose objects.

Now, that said, pack-file will generally _suck_ if you actually do it like 
the above. You actually want to pass in the object names _together_ with 
the filenames they were generated from, so that git-pack-objects can use 
its heuristics for finding good delta candidates.

So what you actually want to do is pass in a set of object names with the 
name of the file they came with (space in between). See for example

	git-rev-list --objects HEAD^..

output for how something like that might look (git-pack-objects is 
designed to take the "git-rev-list --objects" output as its input).

> git-fsck isn't happy either since I have no HEAD.

Yeah, you cannot (and mustn't) run anything like git-fsck-objects or "git 
prune" until you've connected them all up somehow.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  4:01 ` Linus Torvalds
@ 2006-08-04  4:24   ` Jon Smirl
  2006-08-04  4:46     ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-04  4:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

I am converting all of the revisions from each CVS file into git
objects the first time the file is parsed. The plan was to run repack
after each file is finished. That way it should be easy to figure out
the deltas since everything will be a variation on the same file.

So what's the best way to pack these objects, append them to the
existing pack and then clean everything up for the next file? I am
parsing 120K CVS files containing over 1M revs.

After I get all of the objects written and packed later code is going
to write out the trees.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  4:24   ` Jon Smirl
@ 2006-08-04  4:46     ` Linus Torvalds
  2006-08-04  5:01       ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04  4:46 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Fri, 4 Aug 2006, Jon Smirl wrote:
>
> I am converting all of the revisions from each CVS file into git
> objects the first time the file is parsed. The plan was to run repack
> after each file is finished. That way it should be easy to figure out
> the deltas since everything will be a variation on the same file.

Sure. In that case, just list the object ID's in the exact same order you 
created them.

Basically,as you create them, just keep a list of all ID's you've created, 
and every (say) 50,000 objects, just do a

	echo all objects you've created | git-pack-objects new-pack

and then move the new pack into place, and remove all the loose objects 
(don't even bother using "git prune" - just basically do something like
"rm -rf .git/objects/??" to get rid of them).

> So what's the best way to pack these objects, append them to the
> existing pack and then clean everything up for the next file? I am
> parsing 120K CVS files containing over 1M revs.

You'll want to repack every once in a while just to not ever have _tons_ 
of those loose objects around, but if you do it every 50,000 objects, 
you'll have just twenty nice pack-files once you're done, containing all 
one million objects, and you'll never have had more than ~200 files in any 
of the loose object subdirectories.

Of course, you might want to make that "every 50,000 object" thing 
tunable, so that if you don't have a lot of memory for caching, you might 
want to do it a bit more often just to make each repack go faster and not 
have tons of IO. 

You can then do a _full_ repack to get one big object, by just listing 
every object you ever created (in creation order) to git-pack-objects, and 
then you can replace all the twenty (smaller) pack-files with the 
resulting single bigger one.

In fact, at that point you no longer even need to worry about "creation 
order", since you've basically created all the deltas in the first phase, 
and regardless of ordering, when you then repack everything at the end, it 
will re-use all earlier delta information.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  4:46     ` Linus Torvalds
@ 2006-08-04  5:01       ` Linus Torvalds
  2006-08-04  5:11         ` Jon Smirl
  2006-08-04 14:40         ` Jon Smirl
  0 siblings, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04  5:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Thu, 3 Aug 2006, Linus Torvalds wrote:
> 
> Sure. In that case, just list the object ID's in the exact same order you 
> created them.

Btw, you still want to give a filename for each object you've created, so 
that the delta sorter does the right thing for the packing. It doesn't 
have to be a _real_ filename - just make sure that each revision that 
comes from the same file has a filename that matches all the other 
revisions from that file.

What the filename actually _is_ doesn't much matter, and it doesn't have 
to be the "real" filename that was associated with that set of revisions, 
since we'll just end up hashing it anyway. So it could be some "SVN inode 
number" for that set of revisions or something, for all git-pack-object 
cares.

So you could just go through each SVN file in whatever the SVN database is 
(I don't know how SVN organizes it), generate every revisions for that 
file, and pass in the SVN _database_ filename, rather than necessarily the 
filename that that revision is actually associated with when checked out.

So for example, if SVN were to use the same kind of "Attic/filename,v" 
format that CVS uses, there's no reason to worry what the real filename 
was in any particular checked out tree, you could just pass 
git-pack-objects a series of lines in the form of

	..
	<sha1-object-name-of-rev1> Attic/filename,v
	<sha1-object-name-of-rev2> Attic/filename,v
	<sha1-object-name-of-rev3> Attic/filename,v
	..

as input on its stdin, and it will create a pack-file of all the objects 
you name, and use the "Attic/filename,v" info as the deltifier hint to 
know to do all the deltas of those revs against each other rather than 
against random other objects.

The fact that the file was actually checked out as "src/filename" (and, 
since SVN supports renaming, it might have been checked out under any 
number of _other_ names over the history of the project) doesn't matter, 
and you don't need to even try to figure that out. git-pack-objects 
wouldn't care anyway.

			Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  5:01       ` Linus Torvalds
@ 2006-08-04  5:11         ` Jon Smirl
  2006-08-04 14:40         ` Jon Smirl
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-04  5:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Thu, 3 Aug 2006, Linus Torvalds wrote:
> >
> > Sure. In that case, just list the object ID's in the exact same order you
> > created them.
>
> Btw, you still want to give a filename for each object you've created, so

I'll add a file name hint.

I'm converting the cvs2svn tool to do cvs2git.

Martin has a copy of it up under git. I haven't checked in any of my
changes yet.
http://git.catalyst.net.nz/gitweb?p=cvs2svn.git;a=summary

If you read the log it is obvious that these guys have done major work
to deal with all kinds of broken CVS repositories. I want to piggyback
on that work and reuse their code that builds change sets.  So far
this is the only tool I have found that can import the Mozilla CVS
without errors. Only problem is that it imports it to SVN instead of
git. I'm fixing that and learning Python at the same time.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04  5:01       ` Linus Torvalds
  2006-08-04  5:11         ` Jon Smirl
@ 2006-08-04 14:40         ` Jon Smirl
  2006-08-04 14:50           ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 14:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

One thing is obvious, I need to tune the repacks to happen before
things spill out of the cache.  git repack-objects has been chugging
away for 2hrs now at 2% CPU and 3000 io/sec. It is in one of those
modes where it went back to get the early stuff and in the process of
getting that it knocked the later stuff out of the cache basically
rendering the cache useless.

I'm making good progress with this. I have hit two bugs in cvs2svn
that I will need to get fixed. cvs2svn is claiming two of the ,v files
to be invalid but to my eyes they look ok.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 14:40         ` Jon Smirl
@ 2006-08-04 14:50           ` Jon Smirl
  2006-08-04 15:22             ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 14:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

The whole problem with CVS import is avoiding getting IO bound. Since
Mozilla CVS expands into 20GB when the revisions are separated out
doing all that IO takes a lot of time. When these imports take four
days it is all IO time, not CPU.

Could repack-objects be modified to take the objects on stdin as I
generate them instead of me putting them into the file system and then
deleting them? That model would avoid many gigabytes of IO.

It might work to just stream the output from zlib into repack-objects
and let it recompute the object name.  Or could I just stream in the
uncompressed objects? I can still compute the object sha name in my
code so that I can find it later.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 14:50           ` Jon Smirl
@ 2006-08-04 15:22             ` Linus Torvalds
  2006-08-04 15:41               ` Jon Smirl
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 15:22 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Fri, 4 Aug 2006, Jon Smirl wrote:
> 
> Could repack-objects be modified to take the objects on stdin as I
> generate them instead of me putting them into the file system and then
> deleting them? That model would avoid many gigabytes of IO.

I'd suggest against it, but you can (and should) just repack often enough 
that you shouldn't ever have gigabytes of objects "in flight". I'd have 
expected that with a repack every few ten thousand files, and most files 
being on the order of a few kB, you'd have been more than ok, but 
especially if you have large files, you may want to make things "every <n> 
bytes" rather than "every <n> files".

You _could_ also decide to create packs very aggressively indeed, and if 
you do them quickly enough, the raw objects never even get written back to 
disk before you delete them. That will leave you with a lot of packs, but 
you could then "repack the packs" every once in a while.

That said, it's obviously not _impossible_ to do what you suggest, it's 
just major surgery to pack-objects (which I'm not going to have time to 
do, since I'll be going on a vacation this weekend).

			Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 15:22             ` Linus Torvalds
@ 2006-08-04 15:41               ` Jon Smirl
  2006-08-04 16:01                 ` A Large Angry SCM
                                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 15:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> I'd suggest against it, but you can (and should) just repack often enough
> that you shouldn't ever have gigabytes of objects "in flight". I'd have
> expected that with a repack every few ten thousand files, and most files
> being on the order of a few kB, you'd have been more than ok, but
> especially if you have large files, you may want to make things "every <n>
> bytes" rather than "every <n> files".

How about forking off a pack-objects and handing it one file name at a
time over a pipe. When I hand it the next file name I delete the first
file. Does pack-objects make multiple passes over the files? This
model would let me hand it all 1M files.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 15:41               ` Jon Smirl
@ 2006-08-04 16:01                 ` A Large Angry SCM
  2006-08-04 16:11                   ` Jon Smirl
  2006-08-04 16:56                   ` Linus Torvalds
  2006-08-04 16:39                 ` Rogan Dawes
  2006-08-04 16:53                 ` Linus Torvalds
  2 siblings, 2 replies; 36+ messages in thread
From: A Large Angry SCM @ 2006-08-04 16:01 UTC (permalink / raw)
  To: git; +Cc: Jon Smirl, Linus Torvalds

Jon Smirl wrote:
> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> I'd suggest against it, but you can (and should) just repack often enough
>> that you shouldn't ever have gigabytes of objects "in flight". I'd have
>> expected that with a repack every few ten thousand files, and most files
>> being on the order of a few kB, you'd have been more than ok, but
>> especially if you have large files, you may want to make things "every 
>> <n>
>> bytes" rather than "every <n> files".
> 
> How about forking off a pack-objects and handing it one file name at a
> time over a pipe. When I hand it the next file name I delete the first
> file. Does pack-objects make multiple passes over the files? This
> model would let me hand it all 1M files.
> 

Why don't you just write the pack file directly? Pack files without 
deltas have a very simple structure, and git-index-pack will create a 
pack index file for the pack file you give it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 16:01                 ` A Large Angry SCM
@ 2006-08-04 16:11                   ` Jon Smirl
  2006-08-04 16:32                     ` Linus Torvalds
  2006-08-04 16:56                   ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 16:11 UTC (permalink / raw)
  To: gitzilla; +Cc: git, Linus Torvalds

On 8/4/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
> Jon Smirl wrote:
> > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >> I'd suggest against it, but you can (and should) just repack often enough
> >> that you shouldn't ever have gigabytes of objects "in flight". I'd have
> >> expected that with a repack every few ten thousand files, and most files
> >> being on the order of a few kB, you'd have been more than ok, but
> >> especially if you have large files, you may want to make things "every
> >> <n>
> >> bytes" rather than "every <n> files".
> >
> > How about forking off a pack-objects and handing it one file name at a
> > time over a pipe. When I hand it the next file name I delete the first
> > file. Does pack-objects make multiple passes over the files? This
> > model would let me hand it all 1M files.
> >
>
> Why don't you just write the pack file directly? Pack files without
> deltas have a very simple structure, and git-index-pack will create a
> pack index file for the pack file you give it.

That is under consideration but the undeltafied pack is about 12GB and
it takes forever (about a day) to deltafy it. I'm not convinced yet
that an undeltafied pack is any faster than just having the objects in
the directories.

The same data in a deltafied pack is 700MB. That is a tremendous
difference in the amount of IO needed. The strategy has to be to avoid
IO, nothing I am doing is ever CPU bound.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 16:11                   ` Jon Smirl
@ 2006-08-04 16:32                     ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 16:32 UTC (permalink / raw)
  To: Jon Smirl; +Cc: gitzilla, git



On Fri, 4 Aug 2006, Jon Smirl wrote:
> 
> That is under consideration but the undeltafied pack is about 12GB and
> it takes forever (about a day) to deltafy it. I'm not convinced yet
> that an undeltafied pack is any faster than just having the objects in
> the directories.

Yeah, I think it's worth it deltifying things early, as you seem to get 
all the object info in the right order anyway (ie you do the revisions for 
one file in one go).

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 16:01                 ` A Large Angry SCM
  2006-08-04 16:11                   ` Jon Smirl
@ 2006-08-04 16:56                   ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 16:56 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: git, Jon Smirl

On Fri, 4 Aug 2006, A Large Angry SCM wrote:
> 
> Why don't you just write the pack file directly? Pack files without deltas
> have a very simple structure, and git-index-pack will create a pack index file
> for the pack file you give it.

Pack-files without deltas are really huge. You really really don't want to 
do this for some medium-large file that has several thousand revisions.

The reason you want to generate the deltas early is that then, once you've 
generated all the simple and obvious deltas (and within each *,v file from 
CVS, they are all simple and obvious), doing a "git repack -a -d" will be 
able to re-use the deltas you found, making it a much cheaper operation.

NOTE! For that "git repack -a -d" to work, you'd obviously only do it at 
the very end, when you've tied together all the blobs with trees and 
commits (since "git repack" wants to follow the reachability chain).

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 15:41               ` Jon Smirl
  2006-08-04 16:01                 ` A Large Angry SCM
@ 2006-08-04 16:39                 ` Rogan Dawes
  2006-08-04 16:53                   ` Jon Smirl
  2006-08-04 16:53                 ` Linus Torvalds
  2 siblings, 1 reply; 36+ messages in thread
From: Rogan Dawes @ 2006-08-04 16:39 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl wrote:
> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> I'd suggest against it, but you can (and should) just repack often enough
>> that you shouldn't ever have gigabytes of objects "in flight". I'd have
>> expected that with a repack every few ten thousand files, and most files
>> being on the order of a few kB, you'd have been more than ok, but
>> especially if you have large files, you may want to make things "every 
>> <n>
>> bytes" rather than "every <n> files".
> 
> How about forking off a pack-objects and handing it one file name at a
> time over a pipe. When I hand it the next file name I delete the first
> file. Does pack-objects make multiple passes over the files? This
> model would let me hand it all 1M files.
> 

I'd imagine that this would not necessarily save you a lot, if you have 
to write it to disk, and then read it back again. Your only chance here 
is if you stay in the buffer, and avoid actually writing to disk at all.

Of course, using a ramdisk/tmpfs for your object directories might be 
enough to save you. Just use a symlink to tmpfs for the objects 
directory, and leave the pack files on persistent storage.

That doesn't answer your question about how many passes pack-objects 
does. Nicholas Pitre should be able to answer that.

Rogan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 16:39                 ` Rogan Dawes
@ 2006-08-04 16:53                   ` Jon Smirl
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 16:53 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: git

On 8/4/06, Rogan Dawes <discard@dawes.za.net> wrote:
> Jon Smirl wrote:
> > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >> I'd suggest against it, but you can (and should) just repack often enough
> >> that you shouldn't ever have gigabytes of objects "in flight". I'd have
> >> expected that with a repack every few ten thousand files, and most files
> >> being on the order of a few kB, you'd have been more than ok, but
> >> especially if you have large files, you may want to make things "every
> >> <n>
> >> bytes" rather than "every <n> files".
> >
> > How about forking off a pack-objects and handing it one file name at a
> > time over a pipe. When I hand it the next file name I delete the first
> > file. Does pack-objects make multiple passes over the files? This
> > model would let me hand it all 1M files.
> >
>
> I'd imagine that this would not necessarily save you a lot, if you have
> to write it to disk, and then read it back again. Your only chance here
> is if you stay in the buffer, and avoid actually writing to disk at all.

If I keep creating files, reading them and then deleting them then it
is likely that the same blocks are being used over and over. Since the
blocks are reused it will stop the cache thrashing. Some disk writes
will still happen but that is way better than doing 12GB of unique
writes followed by 12GB of reads. The 24GB of IO is all reads on small
files so it is seek time limited since repack does writes in the
middle of the reads.

> Of course, using a ramdisk/tmpfs for your object directories might be
> enough to save you. Just use a symlink to tmpfs for the objects
> directory, and leave the pack files on persistent storage.

The unpacked set of objects is way to big to fit into RAM. Any scheme
using the unpacked objects will spill to disk.

> That doesn't answer your question about how many passes pack-objects
> does. Nicholas Pitre should be able to answer that.
>
> Rogan
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 15:41               ` Jon Smirl
  2006-08-04 16:01                 ` A Large Angry SCM
  2006-08-04 16:39                 ` Rogan Dawes
@ 2006-08-04 16:53                 ` Linus Torvalds
  2006-08-04 17:17                   ` Jon Smirl
  2 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 16:53 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Fri, 4 Aug 2006, Jon Smirl wrote:
> 
> How about forking off a pack-objects and handing it one file name at a
> time over a pipe. When I hand it the next file name I delete the first
> file. Does pack-objects make multiple passes over the files? This
> model would let me hand it all 1M files.

pack-objects does actually make several (well, two) passes over the 
objects right now, because it first does all the sorting based on object 
size/type, and then does the actual deltifying pass. 

But doing things one file-name at a time would certainly be fine. You can 
even do it with git-pack-objects running in parallel, ie you can do a

	for_each_filename() {
		cvs-generate-objects filename | git-pack-objects filename
		rm -rf .git/objects/??/
	}

and then "cvs-generate-objects" should just make sure that it writes the 
git object _before_ it actually outputs the object name on stdout.

And if you do it this way, you won't even have to pass any filenames, 
since git-pack-objects will only get objects for the same file, and will 
do the right thing just sorting them by size.

So in the above kind of setting, the _only_ thing that 
cvs-generate-objects needs to do is:

	for_each_rev(file) {
		unsigned char sha1[20];
		unsigned long len;
		void *buf;

		/* unpack the revision into memory */
		buf = cvs_unpack_revision(&len);

		/* Write it out as a git blob file */
		write_sha1_file(buf, len, "blob", sha1);

		/* Free the memory image */
		free(buf);

		/* Tell git-pack-objects the name of the git blob */
		printf("%s\n", sha1_to_hex(sha1));
	}

and you're basically all done. The above would turn each *,v file into a 
*-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many 
pack-files as you have *,v files.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 16:53                 ` Linus Torvalds
@ 2006-08-04 17:17                   ` Jon Smirl
  2006-08-04 17:29                     ` Linus Torvalds
  2006-08-05  4:15                     ` Martin Langhoff
  0 siblings, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-04 17:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> and you're basically all done. The above would turn each *,v file into a
> *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many
> pack-files as you have *,v files.

I'll end up with 110,000 pack files. I suspect when I run repack over
that it is going to take 24hrs or more, but maybe not since everything
may be small enough to run in RAM. We'll also get to see the
performance of repack with 110K open file handles. How is it going to
figure out which file handle contains which objects?

A new tool might help. It would concatenate the pack files (while
adjusting the headers) and then build a single index. No attempt at
searching for deltas.

To initially build a single pack file it looks like I need a version
of repack that works in a single pass over the input files. To make
things simple it would just delete the file when it has finished
reading it. Since I'm passing in the revisions in optimal order
sorting them probably hurts the pack size. The number of files in
flight will be a function of the pipe buffer size and file names.

I'll work on the tree writing code over the week end.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 17:17                   ` Jon Smirl
@ 2006-08-04 17:29                     ` Linus Torvalds
  2006-08-04 18:06                       ` Linus Torvalds
  2006-08-05  4:15                     ` Martin Langhoff
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 17:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Fri, 4 Aug 2006, Jon Smirl wrote:
> 
> I'll end up with 110,000 pack files. I suspect when I run repack over
> that it is going to take 24hrs or more, but maybe not since everything
> may be small enough to run in RAM.

You may definitely want to pack the pack-files together every once in a 
while. Doing so is not that hard: just list all the objects in all the 
pack-files you want to merge, which in turn is trivial from reading the 
index of the pack-files (and then you do want to do the filename, 
although you can just use the pack-file name if you want to). 

But yeah, it's going to be expensive whatever you do. It's a big repo.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 17:29                     ` Linus Torvalds
@ 2006-08-04 18:06                       ` Linus Torvalds
  2006-08-04 18:24                         ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 18:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git



On Fri, 4 Aug 2006, Linus Torvalds wrote:
> 
> You may definitely want to pack the pack-files together every once in a 
> while. Doing so is not that hard: just list all the objects in all the 
> pack-files you want to merge, which in turn is trivial from reading the 
> index of the pack-files (and then you do want to do the filename, 
> although you can just use the pack-file name if you want to). 

Btw, that index format is actually documented (and it really is _very_ 
simple) in Documentation/technical/pack-format.txt.

To get a list of all object names in a pack-file, you'd basically do just
something like the appended. So with this (let's call it 
"git-list-objects"), you could just do

	for i in $packlist
	do
		git-list-objects $i.idx
	done | git-pack-objects combined-pack

and it would combine all the packs in "$packlist" into one new 
"combined-pack-<sha1>" pack.

And no, I didn't actually _test_ any of this, but it looks pretty damn 
simple.

		Linus

----
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

#define CHUNK (100)

int main(int argc, char **argv)
{
	static unsigned char buffer[24*CHUNK];
	const char *name = argv[1];
	unsigned int n;
	int fd;
	int i;

	if (!name)
		die("no filename!");

	fd = open(name, O_RDONLY);
	if (fd < 0)
		perror(name);

	/* throw away the first-level fan-out */
	if (read(fd, buffer, 4*256) != 4*256)
		perror("read fan-out");

	n = (buffer[4*255 + 0] << 24) +
	    (buffer[4*255 + 1] << 16) +
	    (buffer[4*255 + 2] << 8) +
	    (buffer[4*255 + 3] << 0);

	for (i = 0; i < n; i += CHUNK) {
		int j, left = n - i;
		if (left > CHUNK)
			left = CHUNK;
		if (read(fd, buffer, left*24) != left*24)
			perror("read chunk");
		for (j = 0; j < left; j++) {
			const unsigned char *sha1;
			sha1 = buffer + j*24 + 4;
			printf("%s %s\n", sha1_to_hex(sha1), name);
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 18:06                       ` Linus Torvalds
@ 2006-08-04 18:24                         ` Junio C Hamano
  2006-08-04 19:20                           ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2006-08-04 18:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 4 Aug 2006, Linus Torvalds wrote:
>> 
>> You may definitely want to pack the pack-files together every once in a 
>> while. Doing so is not that hard: just list all the objects in all the 
>> pack-files you want to merge, which in turn is trivial from reading the 
>> index of the pack-files (and then you do want to do the filename, 
>> although you can just use the pack-file name if you want to). 

That would only work *once*, because the resulting pack would
now have blobs from two or more different files and you cannot
tell them apart.  So in order to collapse 110k packs into one,
you would pack packs into one every 330 packs, create trees and
commits for connectivity, and run the final repack -a -d over
the result, or something like that, I suppose...

> Btw, that index format is actually documented (and it really is _very_ 
> simple) in Documentation/technical/pack-format.txt.
>
> To get a list of all object names in a pack-file, you'd basically do just
> something like the appended.

git-show-index?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 18:24                         ` Junio C Hamano
@ 2006-08-04 19:20                           ` Linus Torvalds
  2006-08-04 19:31                             ` Carl Worth
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2006-08-04 19:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, 4 Aug 2006, Junio C Hamano wrote:
> 
> That would only work *once*, because the resulting pack would
> now have blobs from two or more different files and you cannot
> tell them apart.

You don't care. You need to keep track of the blob names separately 
_anyway_: the pack information is not enough to re-create all the revision 
info.

So clearly, to create the tree and commit objects, the cvsimport really 
needs to keep track of the objects it has created, and what their 
relationship is, and it needs to do that separately. The pack-file just 
contains the contents, so that you only ever afterwards need to worry 
about the 20-byte SHA1, not the actual file itself.

> > To get a list of all object names in a pack-file, you'd basically do just
> > something like the appended.
> 
> git-show-index?

Yeah, that might be good.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 19:20                           ` Linus Torvalds
@ 2006-08-04 19:31                             ` Carl Worth
  2006-08-04 19:57                               ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Carl Worth @ 2006-08-04 19:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 314 bytes --]

On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote:
> > > To get a list of all object names in a pack-file, you'd basically do just
> > > something like the appended.
> >
> > git-show-index?
>
> Yeah, that might be good.

That clashes pretty badly with update-index. git-show-pack-index
perhaps?

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 19:31                             ` Carl Worth
@ 2006-08-04 19:57                               ` Junio C Hamano
  2006-08-04 20:08                                 ` Carl Worth
                                                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-08-04 19:57 UTC (permalink / raw)
  To: Carl Worth; +Cc: git

Carl Worth <cworth@cworth.org> writes:

> On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote:
>> > > To get a list of all object names in a pack-file, you'd basically do just
>> > > something like the appended.
>> >
>> > git-show-index?
>>
>> Yeah, that might be good.
>
> That clashes pretty badly with update-index. git-show-pack-index
> perhaps?

There _already_ is a command called git-show-index, since early
July last year ;-).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 19:57                               ` Junio C Hamano
@ 2006-08-04 20:08                                 ` Carl Worth
  2006-08-04 20:08                                 ` Carl Worth
  2006-08-04 20:12                                 ` Jakub Narebski
  2 siblings, 0 replies; 36+ messages in thread
From: Carl Worth @ 2006-08-04 20:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 399 bytes --]

On Fri, 04 Aug 2006 12:57:25 -0700, Junio C Hamano wrote:
> > That clashes pretty badly with update-index. git-show-pack-index
> > perhaps?
>
> There _already_ is a command called git-show-index, since early
> July last year ;-).

Ah, don't mind me then. Just another one of my typical public displays
of ignorance. Time to go back and work harder on memorizing that list
of 120+ commands...

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 19:57                               ` Junio C Hamano
  2006-08-04 20:08                                 ` Carl Worth
@ 2006-08-04 20:08                                 ` Carl Worth
  2006-08-04 20:12                                 ` Jakub Narebski
  2 siblings, 0 replies; 36+ messages in thread
From: Carl Worth @ 2006-08-04 20:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 394 bytes --]

On Fri, 04 Aug 2006 12:57:25 -0700, Junio C Hamano wrote:
> > That clashes pretty badly with update-index. git-show-pack-index
> > perhaps?
>
> There _already_ is a command called git-show-index, since early
> July last year ;-).

Ah, don't mind me then. Just another one of my public displays of
incompetence. Time to go back and work harder on memorizing that list
of 120+ commands...

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 19:57                               ` Junio C Hamano
  2006-08-04 20:08                                 ` Carl Worth
  2006-08-04 20:08                                 ` Carl Worth
@ 2006-08-04 20:12                                 ` Jakub Narebski
  2006-08-04 20:30                                   ` Junio C Hamano
  2 siblings, 1 reply; 36+ messages in thread
From: Jakub Narebski @ 2006-08-04 20:12 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Carl Worth <cworth@cworth.org> writes:
> 
>> On Fri, 4 Aug 2006 12:20:36 -0700 (PDT), Linus Torvalds wrote:
>>> > > To get a list of all object names in a pack-file, you'd basically do
just
>>> > > something like the appended.
>>> >
>>> > git-show-index?
>>>
>>> Yeah, that might be good.
>>
>> That clashes pretty badly with update-index. git-show-pack-index
>> perhaps?
> 
> There _already_ is a command called git-show-index, since early
> July last year ;-).

Parhaps it should be renamed then, for consistency? 

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 20:12                                 ` Jakub Narebski
@ 2006-08-04 20:30                                   ` Junio C Hamano
  2006-08-04 20:37                                     ` Jakub Narebski
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2006-08-04 20:30 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Junio C Hamano wrote:
>
>> There _already_ is a command called git-show-index, since early
>> July last year ;-).
>
> Parhaps it should be renamed then, for consistency? 

There isn't anything to make consistent.  We use the word 
"index" to mean both dircache and the pack index files.
Usually you can tell between the two usage from context.

We could rename it to git-show-pack-index, but the command has
primarily been the debugging aid and not for real use, and I was
actually thinking about removing it, perhaps until now ;-).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 20:30                                   ` Junio C Hamano
@ 2006-08-04 20:37                                     ` Jakub Narebski
  0 siblings, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2006-08-04 20:37 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
>> Junio C Hamano wrote:
>>
>>> There _already_ is a command called git-show-index, since early
>>> July last year ;-).
>>
>> Parhaps it should be renamed then, for consistency? 
> 
> There isn't anything to make consistent.  We use the word 
> "index" to mean both dircache and the pack index files.
> Usually you can tell between the two usage from context.

Oops. I should say, for better readibility (to not depend on context), then.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-04 17:17                   ` Jon Smirl
  2006-08-04 17:29                     ` Linus Torvalds
@ 2006-08-05  4:15                     ` Martin Langhoff
  2006-08-05  5:12                       ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Martin Langhoff @ 2006-08-05  4:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, git

On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > and you're basically all done. The above would turn each *,v file into a
> > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many
> > pack-files as you have *,v files.
>
> I'll end up with 110,000 pack files.

Then just do it every 100 files, and you'll only have 1,100 pack
files, and it'll be fine.

> I suspect when I run repack over
> that it is going to take 24hrs or more,

Probably, but only the initial import has to incur that huge cost.

cheers,



martin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-05  4:15                     ` Martin Langhoff
@ 2006-08-05  5:12                       ` Jon Smirl
  2006-08-05  5:21                         ` Shawn Pearce
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-05  5:12 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Linus Torvalds, git

On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > > and you're basically all done. The above would turn each *,v file into a
> > > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many
> > > pack-files as you have *,v files.
> >
> > I'll end up with 110,000 pack files.
>
> Then just do it every 100 files, and you'll only have 1,100 pack
> files, and it'll be fine.

This is something that has to be tuned. If you wait too long
everything spills out of RAM and you go totally IO bound for days. If
you do it too often you end up with too many packs and it takes a day
to repack them.

If I had a way to pipe the all of the objects into repack one at a
time without repack doing multiple passes none of this tuning would be
necessary. In this model the standalone objects never get created in
the first place. The fastest IO is IO that has been eliminated.

> > I suspect when I run repack over
> > that it is going to take 24hrs or more,
>
> Probably, but only the initial import has to incur that huge cost.

Mozilla developers aren't all rushing to switch to git. A switch needs
to be as painless as possible. If things are too complex they simply
won't switch.

Switching Mozilla to git is going to require a sales job and proof
that the tools are reliable and better than CVS. Right now I can't
even reliably import Mozilla CVS. One of the conditions for even
considering git is that they can easily do the CVS import internally
and verify it for accuracy.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-05  5:12                       ` Jon Smirl
@ 2006-08-05  5:21                         ` Shawn Pearce
  2006-08-05  5:40                           ` Jon Smirl
  2006-08-05  5:46                           ` Shawn Pearce
  0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-05  5:21 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> >On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >> > and you're basically all done. The above would turn each *,v file into 
> >a
> >> > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many
> >> > pack-files as you have *,v files.
> >>
> >> I'll end up with 110,000 pack files.
> >
> >Then just do it every 100 files, and you'll only have 1,100 pack
> >files, and it'll be fine.
> 
> This is something that has to be tuned. If you wait too long
> everything spills out of RAM and you go totally IO bound for days. If
> you do it too often you end up with too many packs and it takes a day
> to repack them.
> 
> If I had a way to pipe the all of the objects into repack one at a
> time without repack doing multiple passes none of this tuning would be
> necessary. In this model the standalone objects never get created in
> the first place. The fastest IO is IO that has been eliminated.

I'm almost done with what I'm calling `git-fast-import`.  It takes
a stream of blobs on STDIN and writes the pack to a file, printing
SHA1s in hex format to STDOUT.  The basic format for STDIN is a 4
byte length (native format) followed by that many bytes of blob data.
It prints the SHA1 for that blob to STDOUT, then waits for another
length.

It naively deltas each object against the prior object, thus it
would be best to feed it one ,v file at a time working from the most
recent revision back to the oldest revision.  This works well for
an RCS file as that's the natural order to process the file in.  :-)

When done you close STDIN and it'll rip through and update the pack
object count and the trailing checksum.  This should let you pack
the entire repository in delta format using only two passes over the
data: one to write out the pack file and one to compute its checksum.

I'll post the code in a couple of hours.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-05  5:21                         ` Shawn Pearce
@ 2006-08-05  5:40                           ` Jon Smirl
  2006-08-05  5:52                             ` Shawn Pearce
  2006-08-05  5:46                           ` Shawn Pearce
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-05  5:40 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Martin Langhoff, Linus Torvalds, git

On 8/5/06, Shawn Pearce <spearce@spearce.org> wrote:
> I'm almost done with what I'm calling `git-fast-import`.  It takes
> a stream of blobs on STDIN and writes the pack to a file, printing
> SHA1s in hex format to STDOUT.  The basic format for STDIN is a 4
> byte length (native format) followed by that many bytes of blob data.
> It prints the SHA1 for that blob to STDOUT, then waits for another
> length.
>
> It naively deltas each object against the prior object, thus it
> would be best to feed it one ,v file at a time working from the most
> recent revision back to the oldest revision.  This works well for
> an RCS file as that's the natural order to process the file in.  :-)

I am already doing this.

> When done you close STDIN and it'll rip through and update the pack
> object count and the trailing checksum.  This should let you pack
> the entire repository in delta format using only two passes over the
> data: one to write out the pack file and one to compute its checksum.

Thinking about this some more, the existing repack code could be made
to work with minor changes. I would like to feed repack 1M revisions
which are sorted by file and then newest to oldest. The problem is
that my expanded revs take up 12GB disk space.

How about adding a flag to repack that simply says delete the objects
when done with them? I'd still create all of the objects on disk.
Repack would assume that they have at least been sorted by filename.
So repack could read in object names until it sees a change in the
file name, sort them by size, deltafy, write out the pack and then
delete the objects from that batch. Then repeat this process for the
next file name on stdin.

I'm making two assumptions, first that blocks from a deleted file
don't get written to disk. And that by deleting the file the file
system will use the same blocks over and over. If those assumptions
are close to being true then the cache shouldn't thrash. They don't
have to be totally true, close is good enough.

Of course eliminating the files all together will be even faster.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-05  5:40                           ` Jon Smirl
@ 2006-08-05  5:52                             ` Shawn Pearce
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-05  5:52 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git

Jon Smirl <jonsmirl@gmail.com> wrote:
> How about adding a flag to repack that simply says delete the objects
> when done with them? I'd still create all of the objects on disk.
> Repack would assume that they have at least been sorted by filename.
> So repack could read in object names until it sees a change in the
> file name, sort them by size, deltafy, write out the pack and then
> delete the objects from that batch. Then repeat this process for the
> next file name on stdin.
> 
> I'm making two assumptions, first that blocks from a deleted file
> don't get written to disk. And that by deleting the file the file
> system will use the same blocks over and over. If those assumptions
> are close to being true then the cache shouldn't thrash. They don't
> have to be totally true, close is good enough.
> 
> Of course eliminating the files all together will be even faster.

See the email I just sent you.  The only file being written is the
pack file that's being generated.  No temporary files, no temporary
inodes, no temporary blocks.  Only two passes over the data: one to
write it out and a second to generate the SHA1.  I do two passes
vs. keep it all in memory to prevent the program from blowing out
on extremely large inputs.

It may be possible to tweak git-pack-objects to get what you propose
above, but to be honest I think the git-fast-import I just sent
was easier, especially as it avoids the temporary loose object stage.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Creating objects manually and repack
  2006-08-05  5:21                         ` Shawn Pearce
  2006-08-05  5:40                           ` Jon Smirl
@ 2006-08-05  5:46                           ` Shawn Pearce
  1 sibling, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-05  5:46 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > On 8/5/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > >On 8/5/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > >> On 8/4/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > >> > and you're basically all done. The above would turn each *,v file into 
> > >a
> > >> > *-<sha>.pack/*-<sha>.idx file pair, so you'd have exactly as many
> > >> > pack-files as you have *,v files.
> > >>
> > >> I'll end up with 110,000 pack files.
> > >
> > >Then just do it every 100 files, and you'll only have 1,100 pack
> > >files, and it'll be fine.
> > 
> > This is something that has to be tuned. If you wait too long
> > everything spills out of RAM and you go totally IO bound for days. If
> > you do it too often you end up with too many packs and it takes a day
> > to repack them.
> > 
> > If I had a way to pipe the all of the objects into repack one at a
> > time without repack doing multiple passes none of this tuning would be
> > necessary. In this model the standalone objects never get created in
> > the first place. The fastest IO is IO that has been eliminated.
> 
> I'm almost done with what I'm calling `git-fast-import`.

OK, now I'm done.  I'm attaching the code.  Toss it into the Makefile
as git-fast-import and recompile.

I tested it with the following Perl script, feeding the Perl script
a list of files that I wanted blobs for on STDIN:

	while (<>) {
		chop;
		print pack('L', -s $_);
		open(F, $_);
		my $buf;
		print $buf while read(F,$buf,128*1024) > 0;
		close F;
	}

This gave me an execution order of:

	find . -name '*.c' | perl test.pl | git-fast-import in.pack
	git-index-pack in.pack

at which point in.pack claims to be a completely valid pack with an
index of in.idx.  Move these into .git/objects/pack, generate trees
and commits, and run git-repack -a -d.  If the order you feed the
objects to git-fast-import in is reasonable (do one RCS file at a
time, feed most recent to least recent revisions) you may not get
any major benefit from using -f during your final repack.

The code for git-fast-import could probably be tweaked to accept
trees and commits too, which would permit you to stream the entire
CVS repository into a single pack file.  :-)

I can't help you decompress the RCS files faster, but hopefully
this will help you generate the GIT pack faster.  Hopefully you
can make use of it!

-- 
Shawn.

[-- Attachment #2: fast-import.c --]
[-- Type: text/x-csrc, Size: 4659 bytes --]

#include "builtin.h"
#include "cache.h"
#include "object.h"
#include "blob.h"
#include "delta.h"
#include "pack.h"
#include "csum-file.h"

static int max_depth = 10;
static unsigned long object_count;
static int packfd;
static int current_depth;
static void *lastdat;
static unsigned long lastdatlen;
static unsigned char lastsha1[20];

static ssize_t yread(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xread(fd, (char *) buffer + ret, length - ret);
		if (size < 0) {
			return size;
		}
		if (size == 0) {
			return ret;
		}
		ret += size;
	}
	return ret;
}

static ssize_t ywrite(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret);
		if (size < 0) {
			return size;
		}
		if (size == 0) {
			return ret;
		}
		ret += size;
	}
	return ret;
}

static unsigned long encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
{
	int n = 1;
	unsigned char c;

	if (type < OBJ_COMMIT || type > OBJ_DELTA)
		die("bad type %d", type);

	c = (type << 4) | (size & 15);
	size >>= 4;
	while (size) {
		*hdr++ = c | 0x80;
		c = size & 0x7f;
		size >>= 7;
		n++;
	}
	*hdr = c;
	return n;
}

static void write_blob (void *dat, unsigned long datlen)
{
	z_stream s;
	void *out, *delta;
	unsigned char hdr[64];
	unsigned long hdrlen, deltalen;

	if (lastdat && current_depth < max_depth) {
		delta = diff_delta(lastdat, lastdatlen,
			dat, datlen,
			&deltalen, 0);
	} else
		delta = 0;

	memset(&s, 0, sizeof(s));
	deflateInit(&s, zlib_compression_level);

	if (delta) {
		current_depth++;
		s.next_in = delta;
		s.avail_in = deltalen;
		hdrlen = encode_header(OBJ_DELTA, deltalen, hdr);
		if (ywrite(packfd, hdr, hdrlen) != hdrlen)
			die("Can't write object header: %s", strerror(errno));
		if (ywrite(packfd, lastsha1, sizeof(lastsha1)) != sizeof(lastsha1))
			die("Can't write object base: %s", strerror(errno));
	} else {
		current_depth = 0;
		s.next_in = dat;
		s.avail_in = datlen;
		hdrlen = encode_header(OBJ_BLOB, datlen, hdr);
		if (ywrite(packfd, hdr, hdrlen) != hdrlen)
			die("Can't write object header: %s", strerror(errno));
	}

	s.avail_out = deflateBound(&s, s.avail_in);
	s.next_out = out = xmalloc(s.avail_out);
	while (deflate(&s, Z_FINISH) == Z_OK)
		/* nothing */;
	deflateEnd(&s);

	if (ywrite(packfd, out, s.total_out) != s.total_out)
		die("Failed writing compressed data %s", strerror(errno));

	free(out);
	if (delta)
		free(delta);
}

static void init_pack_header ()
{
	const char* magic = "PACK";
	unsigned long version = 2;
	unsigned long zero = 0;

	version = htonl(version);

	if (ywrite(packfd, (char*)magic, 4) != 4)
		die("Can't write pack magic: %s", strerror(errno));
	if (ywrite(packfd, &version, 4) != 4)
		die("Can't write pack version: %s", strerror(errno));
	if (ywrite(packfd, &zero, 4) != 4)
		die("Can't write 0 object count: %s", strerror(errno));
}

static void fixup_header_footer ()
{
	SHA_CTX c;
	char hdr[8];
	unsigned char sha1[20];
	unsigned long cnt;
	char *buf;
	size_t n;

	if (lseek(packfd, 0, SEEK_SET) != 0)
		die("Failed seeking to start: %s", strerror(errno));

	SHA1_Init(&c);
	if (yread(packfd, hdr, 8) != 8)
		die("Failed reading header: %s", strerror(errno));
	SHA1_Update(&c, hdr, 8);

fprintf(stderr, "%lu objects\n", object_count);
	cnt = htonl(object_count);
	SHA1_Update(&c, &cnt, 4);
	if (ywrite(packfd, &cnt, 4) != 4)
		die("Failed writing object count: %s", strerror(errno));

	buf = xmalloc(128 * 1024);
	for (;;) {
		n = xread(packfd, buf, 128 * 1024);
		if (n <= 0)
			break;
		SHA1_Update(&c, buf, n);
	}
	free(buf);

	SHA1_Final(sha1, &c);
	if (ywrite(packfd, sha1, sizeof(sha1)) != sizeof(sha1))
		die("Failed writing pack checksum: %s", strerror(errno));
}

int main (int argc, const char **argv)
{
	packfd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666);
	if (packfd < 0)
		die("Can't create pack file %s: %s", argv[1], strerror(errno));

	init_pack_header();
	for (;;) {
		unsigned long datlen;
		int hdrlen;
		void *dat;
		char hdr[128];
		unsigned char sha1[20];
		SHA_CTX c;

		if (yread(0, &datlen, 4) != 4)
			break;

		dat = xmalloc(datlen);
		if (yread(0, dat, datlen) != datlen)
			break;

		hdrlen = sprintf(hdr, "blob %lu", datlen) + 1;
		SHA1_Init(&c);
		SHA1_Update(&c, hdr, hdrlen);
		SHA1_Update(&c, dat, datlen);
		SHA1_Final(sha1, &c);

		write_blob(dat, datlen);
		object_count++;
		printf("%s\n", sha1_to_hex(sha1));
		fflush(stdout);

		if (lastdat)
			free(lastdat);
		lastdat = dat;
		lastdatlen = datlen;
		memcpy(lastsha1, sha1, sizeof(sha1));
	}
	fixup_header_footer();
	close(packfd);

	return 0;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-08-05  5:52 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-04  3:43 Creating objects manually and repack Jon Smirl
2006-08-04  3:58 ` Jeff King
2006-08-04  4:01 ` Linus Torvalds
2006-08-04  4:24   ` Jon Smirl
2006-08-04  4:46     ` Linus Torvalds
2006-08-04  5:01       ` Linus Torvalds
2006-08-04  5:11         ` Jon Smirl
2006-08-04 14:40         ` Jon Smirl
2006-08-04 14:50           ` Jon Smirl
2006-08-04 15:22             ` Linus Torvalds
2006-08-04 15:41               ` Jon Smirl
2006-08-04 16:01                 ` A Large Angry SCM
2006-08-04 16:11                   ` Jon Smirl
2006-08-04 16:32                     ` Linus Torvalds
2006-08-04 16:56                   ` Linus Torvalds
2006-08-04 16:39                 ` Rogan Dawes
2006-08-04 16:53                   ` Jon Smirl
2006-08-04 16:53                 ` Linus Torvalds
2006-08-04 17:17                   ` Jon Smirl
2006-08-04 17:29                     ` Linus Torvalds
2006-08-04 18:06                       ` Linus Torvalds
2006-08-04 18:24                         ` Junio C Hamano
2006-08-04 19:20                           ` Linus Torvalds
2006-08-04 19:31                             ` Carl Worth
2006-08-04 19:57                               ` Junio C Hamano
2006-08-04 20:08                                 ` Carl Worth
2006-08-04 20:08                                 ` Carl Worth
2006-08-04 20:12                                 ` Jakub Narebski
2006-08-04 20:30                                   ` Junio C Hamano
2006-08-04 20:37                                     ` Jakub Narebski
2006-08-05  4:15                     ` Martin Langhoff
2006-08-05  5:12                       ` Jon Smirl
2006-08-05  5:21                         ` Shawn Pearce
2006-08-05  5:40                           ` Jon Smirl
2006-08-05  5:52                             ` Shawn Pearce
2006-08-05  5:46                           ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).