kernel.org and GIT tree rebuilding

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kernel.org and GIT tree rebuilding
@ 2005-06-25  4:20 David S. Miller
  2005-06-25  4:40 ` Jeff Garzik
  2005-06-25  5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
  0 siblings, 2 replies; 39+ messages in thread
From: David S. Miller @ 2005-06-25  4:20 UTC (permalink / raw)
  To: git

To get a clean history to push to Linus, I typically blow
away my trees and make fresh ones to stick patches into
which I want to merge.

That mostly works fine here on my local systems, but I know this
brings the master.org mirroring system to it's knees.  So what is the
generally condoned way to do stuff like this in a more friendly way?

Should I:

1) Do a git pull from Linus's tree once he takes my changes, then
   ask GIT to prune the tree?  How do I do that and how does it work?

2) Should I use .git/object/ database symlinking?

   Are there any scripts out there which do this automatically?
   Something as simple to run as "git-pull-script" and it takes
   care of using links when possible on a local filesystem.

It takes sometimes an hour for my tree updates on master.kernel.org
to propagate to rsync.kernel.org so I can ask Linus to pull.
That's crazy.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  4:20 kernel.org and GIT tree rebuilding David S. Miller
@ 2005-06-25  4:40 ` Jeff Garzik
  2005-06-25  5:23   ` Linus Torvalds
  2005-06-25  5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff Garzik @ 2005-06-25  4:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: git

David S. Miller wrote:
> To get a clean history to push to Linus, I typically blow
> away my trees and make fresh ones to stick patches into
> which I want to merge.
> 
> That mostly works fine here on my local systems, but I know this
> brings the master.org mirroring system to it's knees.  So what is the
> generally condoned way to do stuff like this in a more friendly way?
> 
> Should I:
> 
> 1) Do a git pull from Linus's tree once he takes my changes, then
>    ask GIT to prune the tree?  How do I do that and how does it work?

> It takes sometimes an hour for my tree updates on master.kernel.org
> to propagate to rsync.kernel.org so I can ask Linus to pull.
> That's crazy.

Unfortunately you cannot fix this by changing your actions.  This is the 
cumulative effect of all the git kernel trees on kernel.org.  It now 
takes over an hour for my non-git changes to propagate from master to 
the mirrors, as well.

This is all due to the rsync sweeps, which have to scan metric tons of 
inodes and dentries.  Orders of magnitude over the pre-git days.

ftpadmin@kernel.org folks are supposedly working on an inotify-based 
system, and an improved rsync application.  No ETA or details.

As an aside, cold-cache, git really punishes my disks.  Ted T'so noted 
that it really drains laptop batteries, too.

> 2) Should I use .git/object/ database symlinking?
> 
>    Are there any scripts out there which do this automatically?
>    Something as simple to run as "git-pull-script" and it takes
>    care of using links when possible on a local filesystem.

On both kernel.org and locally, I use 'cp -al' to duplicate the initial 
.git/objects directory, and then rsync (->kernel.org) or git-pull-script 
(<-kernel.org) to update it after that.

That definitely helps.

Maybe somebody needs to script a relink cron job for kernel.org?

	Jeff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  4:20 kernel.org and GIT tree rebuilding David S. Miller
  2005-06-25  4:40 ` Jeff Garzik
@ 2005-06-25  5:04 ` Junio C Hamano
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-25  5:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: git

>>>>> "DSM" == David S Miller <davem@davemloft.net> writes:

DSM> To get a clean history to push to Linus, I typically blow
DSM> away my trees and make fresh ones to stick patches into
DSM> which I want to merge.

DSM> Should I:

DSM> 1) Do a git pull from Linus's tree once he takes my changes, then
DSM>    ask GIT to prune the tree?  How do I do that and how does it work?

DSM> 2) Should I use .git/object/ database symlinking?

DSM>    Are there any scripts out there which do this automatically?
DSM>    Something as simple to run as "git-pull-script" and it takes
DSM>    care of using links when possible on a local filesystem.

git-pull-script internally uses git-fetch-script which knows how
to do the local tree using hardlinks.  Presumably, the following
workflow would work:

 (1) You hack away in your private tree, while you keep a "to be
     published" clean tree, both on your local machine.

 (2) Do a GIT pull, merge in your private tree, to come up with
     a clean set of changes in your private tree.  This is the
     tree you "typically blow away".  Reordering the commits to
     come up with a clean history since you last pulled from
     Linus would also happen in this tree.

 (3) Once you have a commit that you want to publish (i.e. the
     commit chain between that commit and the point you last
     pulled from Linus is the "clean history to push to Linus"),
     you go to your "to be published" clean tree, and run
     git-fetch-script to fetch the commit you want to publish
     from your private tree.  When you give an absolute path as
     the "remote repo", git-local-pull with linking behaviour is
     used by git-fetch-script; otherwise rsync backend is used
     so you end up polluted object database.  This way you copy
     only the clean stuff from your private tree.  Your HEAD in
     this tree should be set to the commit you wanted to
     publish.  Running git-prune would be nicer but if your
     history is truly clean it should not be necessary.

 (4) Garbage collecting with git-prune your private tree is your
     business.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  4:40 ` Jeff Garzik
@ 2005-06-25  5:23   ` Linus Torvalds
  2005-06-25  5:48     ` Jeff Garzik
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-25  5:23 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David S. Miller, git

On Sat, 25 Jun 2005, Jeff Garzik wrote:
> 
> This is all due to the rsync sweeps, which have to scan metric tons of 
> inodes and dentries.  Orders of magnitude over the pre-git days.

Well, the real solution is to use a git-aware protocol, not rsync.

rsync is wonderful for prototyping, and I wanted to make the database 
rsync'able for that reason, but it clearly doesn't scale.

I think I'll make a "pack"/"unpack" pair that just packs all the necessary
objects between two commits. Then you can basically sync the object file 
by doing

	git-pack OLD..NEW | ssh other-end git-unpack

and you'd basically be done. It looks pretty easy to do, too..

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  5:23   ` Linus Torvalds
@ 2005-06-25  5:48     ` Jeff Garzik
  2005-06-25  6:16       ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jeff Garzik @ 2005-06-25  5:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, git

Linus Torvalds wrote:
> 
> On Sat, 25 Jun 2005, Jeff Garzik wrote:
> 
>>This is all due to the rsync sweeps, which have to scan metric tons of 
>>inodes and dentries.  Orders of magnitude over the pre-git days.
> 
> 
> Well, the real solution is to use a git-aware protocol, not rsync.
> 
> rsync is wonderful for prototyping, and I wanted to make the database 
> rsync'able for that reason, but it clearly doesn't scale.
> 
> I think I'll make a "pack"/"unpack" pair that just packs all the necessary
> objects between two commits. Then you can basically sync the object file 
> by doing
> 
> 	git-pack OLD..NEW | ssh other-end git-unpack
> 
> and you'd basically be done. It looks pretty easy to do, too..


The problem is kernel.org mirroring, not individual pushes and pulls, 
really.

Would git-pack be the best solution for mirroring a bunch of git trees?

	Jeff



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  5:48     ` Jeff Garzik
@ 2005-06-25  6:16       ` Linus Torvalds
  2005-06-26 16:41         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-25  6:16 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David S. Miller, git

On Sat, 25 Jun 2005, Jeff Garzik wrote:
> 
> The problem is kernel.org mirroring, not individual pushes and pulls, 
> really.
> 
> Would git-pack be the best solution for mirroring a bunch of git trees?

No guarantees, but here's a rough plan:

 - I just committed a fairly trivial change to add a "--objects" flag to 
   git-rev-list, which allows you to basically say "I want to see the 
   difference not just in commit ID's, but also trees and blobs"

   What does that mean? It means that in a mirroring schenario, you can, 
   for each git tree, do:

	(a) On the slave:
		cat .git/refs/*/* | sort | uniq > slave-ref-list

	(b) On the master:
		cat .git/refs/*/* | sort | uniq > master-ref-list

	(c) On the master:

		cmp $master-ref-list $slave-ref-list && exit 1
		list=$(cat master-ref-list)
		for i in $(cat slave-ref-list)
		do
			list=$list ^$i
		done
		git-rev-list --objects $list

   and now that "git-rev-list" will list every object that needs to be 
   copied from the master to the slave. No need to read huge directories 
   etc, you get the list computed for you.

yeah, it clearly needs some refining to be useful, but I think you can
kind of see how it would work.

Now, the secondary advantage of this is that once you don't use rsync as
the mirroring method, you can now change the filesystem object database
layout. In particular, the packing thing that Chris Mason was working on
at some point suddenly becomes a lot more viable.

(In fact, more than that. You can make a single packed blob for all
"historical" objects, and that also gives you an efficient archive format
- if you're not required to have the full filesystem layout, you could
have a much more efficient packing that you basically do once a week or
something, so that you only keep the last week in the regular "one file
per object" format).

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-25  6:16       ` Linus Torvalds
@ 2005-06-26 16:41         ` Linus Torvalds
  2005-06-26 18:39           ` Junio C Hamano
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 16:41 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason

On Fri, 24 Jun 2005, Linus Torvalds wrote:
> 
> yeah, it clearly needs some refining to be useful, but I think you can
> kind of see how it would work.

Ok, here's how it works.

 - Pick a starting commit (or a hundred)

 - Pick an ending commit (or a hundred)

 - generate the list of objects in between them

	git-rev-list --object end ^start > object-list

 - Pack that list of objects into an "object pack":

	git-pack-objects out < object-list

   (This actually generates two files: "out.idx" is the index file, 
   "out.pack" is the data file, but I'll make it concatenate the two at 
   some point)

 - move the pack-files over somewhere else

 - unpack them

	git-unpack-objects out

and you're done.

Now, the reason I use "pack" and "unpack" instead of just "tar" to
transport the objects is that this allows me to do a fairly efficient
packing. I wanted these pack-files to be independent (ie they do _not_
depend on any objects outside of the pack-file), but within the objects
described in the pack I cna do delta-compression.

Now, that doesn't much help for small updates (where the objects are just 
unrelated and have no deltas), but it helps increasingly for big ones. The 
biggest one obviously being the whole path from the start to the HEAD..

For example, the "du -sh .git/objects" for the git project itself is 17MB 
for me, and I can do:

	torvalds@ppc970:~/git> du -sh .git/objects 
	17M     .git/objects

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects out
	Packing 3656 objects

	real    0m3.779s
	user    0m3.169s
	sys     0m0.602s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:12 out.idx
	-rw-rw-r--  1 torvalds torvalds 2.0M Jun 26 09:12 out.pack

ie it packs down to a nice 2MB pack-file with a small index. Move that
over to somewhere else, and unpack it, and you'll get all the regular
objects (it doesn't move tags and refs over, you'll have to do that
outside of the packing).

Now, you can trade off some packing time to get a better pack:

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 out
	Packing 3656 objects

	real    0m11.953s
	user    0m11.294s
	sys     0m0.663s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:14 out.idx
	-rw-rw-r--  1 torvalds torvalds 1.6M Jun 26 09:14 out.pack

and if you want to allow deep delta chains (the default delta depth
limiting is 10), you can get even better results:

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 --depth=100 out
	Packing 3656 objects

	real    0m12.374s
	user    0m11.704s
	sys     0m0.659s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:16 out.idx
	-rw-rw-r--  1 torvalds torvalds 1.3M Jun 26 09:16 out.pack

but then unpacking will slightly heavier.

(Doing the same for the kernel is obviously much more expensive just
because the kernel is so much bigger. A big delta discovery window like
100 takes about fifteen minutes to pack on my machine, but gets the
current kernel archive down to 70MB or so. That's ok for a monthly "pack
all the objects" to keep size requirements down, but you clearly don't
want to do this all the time ;).

Now, perhaps the more interesting part is that I also designed the pack
format so that it should be a good "history" format, not just a way to
move objects from one place to the other. Ie if you worry about diskspace,
you can pack everything up to the now into one big pack, and then remove
the original objects.

Don't do that yet, btw - I haven't actually written the code to read stuff
out of packs if we don't find it in the object directory yet, but the
layout is such that it should be straightforward and pretty efficient (but
there a deep delta chain obviously _will_ cause a performance hit).

I actually like this approach better than having delta-objects in the
filesystem. Partly because the pack-file is self-contained, partly because
it also solves the fs blocking issue, yet is still efficient to look up
the results without having hardlinks etc to duplicate objects virtually.  
And when you do the packing by hand as an "archival" mechanism, it also
doesn't have any of the downsides that Chris' packing approach had.

Nico? Chris? Interested in giving it a look? It's kind of a combination of 
your things, generalized and then made to have fast lookup with the index.

Fast lookup doesn't matter for a normal unpack, of course, and if I just
always wanted to unpack all the objects (ie just an object transfer
mechanism) I'd have made the index be a toposort of the objects. But
because I wanted to be able to use it as an archival format, I needed it
to be "random-access" by object name. So the index is in fact a binary
tree (well, sorted array, so the lookup degenerates into a binary search)
with a top-level index splitting up the contents based on the first byte
(the same way the filesystem layout does).

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 16:41         ` Linus Torvalds
@ 2005-06-26 18:39           ` Junio C Hamano
  2005-06-26 19:19             ` Linus Torvalds
  2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
  2005-06-28 18:06           ` Nicolas Pitre
  2 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-26 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> I actually like this approach better than having delta-objects in the
LT> filesystem. Partly because the pack-file is self-contained, partly because
LT> it also solves the fs blocking issue, yet is still efficient to look up
LT> the results without having hardlinks etc to duplicate objects virtually.  
LT> And when you do the packing by hand as an "archival" mechanism, it also
LT> doesn't have any of the downsides that Chris' packing approach had.

After analyzing what is involved in making packed GIT integrated
into read_sha1_file() [*1*], I agree 100% with the above.  I
mean no disrespect to what Nico has done (and I myself have done
some code to work with Nico's deltified objects when I did diffs
and pull fixes), but it would help the code very much if we do
not have to worry about "delta" objects in GIT_OBJECT_DIRECTORY.

My preference is to do things in this order:

 (0) concatenate pack and idx files;

 (1) teach read_sha1_file() to read from packed GIT;

 (2) teach fsck-cache about packed GIT;

 (3) have people with deltified repositories convert them back
     to undeltified (I think git-pack-objects would barf on such
     repository);

 (4) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means
     that git-deltafy-script and git-mkdelta have to go.

 (5) tell git-*-pull about packed GIT;


[Footnotes]

*1* Here is the analysis I did last night, still assuming that
we would support "delta" objects in GIT_OBJECT_DIRECTORY.  The
"trickier" map_sha1_file() users almost all involve "delta"
objects, and that is why I prefer dropping them.

 - Enhance GIT_ALTERNATE_OBJECT_DIRECTORIES mechanism so that
   its component can be either a directory or a packed file.

 - sha1_file.c::find_sha1_file() has to be enhanced to express
   not just path (in the current "individual object file"
   case) but a pointer to a structure that describes a packed
   file in the GIT_ALTERNATE_OBJECT_DIRECTORIES list with the
   offset for the entry.

 - The change necessary to sha1_file.c::has_sha1_file() is
   minimum.  find_sha1_file() updated along the above lines
   would say if the thing exists or not anyway, so it can just
   return true/false as it currently does pretty easily.

 - sha1_file.c::read_sha1_file() would be the primary piece to
   unpack from the packed representation.

 - sha1_file.c::map_sha1_file() is trickier.  It has handful
   callers outside sha1_file.c for valid reasons, so we will
   need to audit the callers and have them fall back on
   read_sha1_file() as appropriate.  Here is the result of my
   first pass:

   - (easy) sha1_delta_base() is used only when an object is
     delitified, and if true get to the base object.  We can
     just tell the caller our object is not deltified when it
     resides in a packed file.

   - (easy) sha1_file_size() is used by diffcore to measure the
     expanded blob size.  Although the implementation obviously
     has to be different, it would be trivial to find the size
     if the object resides in a packed file.

   - (easy) pack-objects.c::check_object() uses map_sha1_file() so that
     it can unpack small to get the type of the object.  We
     should be able to introduce a new interface (say,
     sha1_file.c::sha1_object_type()) for doing this sort of
     stuff.

   - (harder) mkdelta.c::get_buffer(), object.c::parse_object()
     and delta.c::process_delta() are trickier, because they
     want to treat "delta" as a raw object (otherwise we would
     have just done sha1_read_file() instead of
     map/unpack_sha1_file pair).

   - (harder) ssh-push.c::serve_object() also wants raw
     representation to directly ship to the other end.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 18:39           ` Junio C Hamano
@ 2005-06-26 19:19             ` Linus Torvalds
  2005-06-26 19:45               ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 19:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason

On Sun, 26 Jun 2005, Junio C Hamano wrote:
> 
> My preference is to do things in this order:
> 
>  (0) concatenate pack and idx files;

Actually, I was originally planning to do that, but now that I have 
thought about what read_sha1_file() would actually do, I think it's more 
efficient to leave the index as a separate file.

In particular, what you'd normally do is that if you can't look up the
file in the regular object directory, you start going through the pack
files. You can do it by having GIT_ALTERNATE_OBJECT_DIRECTORIES point to a
pack file, but I actually would prefer the notion of just adding a

	.git/objects/pack

subdirectory, and having object lookup just automatically open and map all 
index files in that subdirectory.

And the thing is, you really just want to map the index files, the data
files can be so big that you can't afford to map them (ie a really big
project might have several pack-files a gig each or something like that).

And the most efficient way to map just the index file is to keep it 
separate, because then the "stat()" will just get the information 
directly, and you then just mmap that. 

The alternative is to first read the index of the index (to figure out how
big the index is), and then map the rest. But that just seems a lot
messier than just mapping the index file directly.

And when creating these things, we do need to create the data file (which 
can be big enough that it doesn't fit in memory) first, so we have to have 
a separate file for it, we can't just stream it out to stdout.

Now, when _sending_ the pack-files, linearizing them is easy: you just 
send the index first, and the data file immediately afterwards. The index 
tells how big it is, so there's no need to even add any markers: you can 
do something like 'git-send-script' with something simple like

	git-rev-list ... | git-pack-file tmp-pack &&
	cat tmp-pack.idx tmp-pack.data | ssh other git-receive-script

So let's just keep the index/data files separate.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 19:19             ` Linus Torvalds
@ 2005-06-26 19:45               ` Junio C Hamano
       [not found]                 ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-26 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> On Sun, 26 Jun 2005, Junio C Hamano wrote:
>> 
>> My preference is to do things in this order:
>> 
>> (0) concatenate pack and idx files;

LT> So let's just keep the index/data files separate.

Fair enough.  Having thought about it a bit more, if people
agree, I think it would make more sense to rip out "delta"
object support first before doing read_sha1_file() and friends
that uses .git/objects/pack.

My "preferred order" now look like this:

 (1) have people with deltified repositories convert them back
     to undeltified (I think git-pack-objects would barf on such
     repository);

 (2) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means
     that git-deltafy-script and git-mkdelta have to go.

 (3) teach read_sha1_file() to read from packed GIT in
     .git/objects/pack;

 (4) teach fsck-cache about packed GIT;

 (5) tell git-*-pull about packed GIT;

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 16:41         ` Linus Torvalds
  2005-06-26 18:39           ` Junio C Hamano
@ 2005-06-26 20:52           ` Chris Mason
  2005-06-26 21:03             ` Chris Mason
  2005-06-26 21:40             ` Linus Torvalds
  2005-06-28 18:06           ` Nicolas Pitre
  2 siblings, 2 replies; 39+ messages in thread
From: Chris Mason @ 2005-06-26 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre

On Sunday 26 June 2005 12:41, Linus Torvalds wrote:
> On Fri, 24 Jun 2005, Linus Torvalds wrote:
> > yeah, it clearly needs some refining to be useful, but I think you can
> > kind of see how it would work.
>
> Ok, here's how it works.
>
>  - Pick a starting commit (or a hundred)
>
>  - Pick an ending commit (or a hundred)
>
>  - generate the list of objects in between them
>
> 	git-rev-list --object end ^start > object-list
>
>  - Pack that list of objects into an "object pack":
>
> 	git-pack-objects out < object-list

Without having read the code, the big thing that hurt performance in my early 
packed file work was compressing the whole packed file instead of individual 
sub-objects.  It takes more room to compress each object, but when I 
compressed the whole thing read performance was quite bad.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
@ 2005-06-26 21:03             ` Chris Mason
  2005-06-26 21:40             ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Chris Mason @ 2005-06-26 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre

On Sunday 26 June 2005 16:52, Chris Mason wrote:
> >
> > 	git-rev-list --object end ^start > object-list
> >
> >  - Pack that list of objects into an "object pack":
> >
> > 	git-pack-objects out < object-list
>
> Without having read the code, the big thing that hurt performance in my
> early packed file work was compressing the whole packed file instead of
> individual sub-objects.  It takes more room to compress each object, but
> when I compressed the whole thing read performance was quite bad.

Sorry, fat fingered the send key...

The hard links were the biggest problem with my packed file patches, I think 
the dynamic lookup in a separate packed file index is the best way to go.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
  2005-06-26 21:03             ` Chris Mason
@ 2005-06-26 21:40             ` Linus Torvalds
  2005-06-26 22:34               ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 21:40 UTC (permalink / raw)
  To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre

On Sun, 26 Jun 2005, Chris Mason wrote:
> 
> Without having read the code, the big thing that hurt performance in my early 
> packed file work was compressing the whole packed file instead of individual 
> sub-objects.  It takes more room to compress each object, but when I 
> compressed the whole thing read performance was quite bad.

Since I wanted random-access, compressing the whole thing just wasn't an 
option.

Besides, the big space savings come from finding deltas, which is 
obviously also a compression, just at a higher level. The biggest problem 
there is to find a guess of objects to try to delta against, and right now 
that part is pretty stupid and could possibly be improved (it just sorts 
objects by size and tries to delta against "close" objects).

To generate a better sort _would_ actually be pretty close to doing a 
global compression (it really does boil down to the same thing: finding 
big sub-sequences, except it's in a "fragmented" space), but one issue is 
that I don't want to read in the whole data set in one go, so it would 
have to be based on some rolling hash or something. Davide pointed to 
rzip, and a variation of that (which knows about object boundaries) might 
work.

(You can also sort by filename, if you want to try. I don't track
filenames at all there and it's actually non-trivial to do, so that would
require some new and pretty nasty code, but it's possible in _theory_ at
least.)

Anyway, that's all potential improvement for generating better packing,
and it should certainly be possible without changing the format - just
generate a better initial sort, in otder to find more deltas (or rather,
find them faster by using a smaller window size).

So the stupid sort I have now does actually work, but exactly because it's
so stupid it wants a big window for best packing (because there might be a
lot of objects that aren't interesting), which in turn is quite expensive.
So a better sort would make a smaller window more effective.

[ Some numbers: a window of 10 objects is the default, and packs the
  current kernel down to 77MB in 2m21s. A window of 20 objects improves
  that packing to 71MB, but makes the packing time go up to 3m36s for me.  
  And a window of 100 gets us down to 62M but takes 11m54s.

  A window of 200 (with a delta depth of 200 too - likely _way_ too deep
  for normal use) gives you a 59M pack, but takes 20m59s, so there's
  definitely a point of diminishing returns.

  This is all for the current HEAD, which takes up 264M the "traditional" 
  git way and takes 141M without any deltas, just packed tightly with no 
  filesystem blocking.

  Now, as you can notice that's actually a slightly sub-linear increase in
  time, because as we find a delta, we will only accept smaller deltas in
  the future, so we can often stop comparing even before we've reached the
  maximum window size, and so effort is slightly less than linear because 
  there's effectively a constant component to part of it.

  Also, the good news is that you probably don't want to generate one 
  humungous pack archive anyway, but you're likely better off doing a new 
  incremental pack every few months. So we'll never have the situation 
  that creating a pack gets increasingly more costly, since at some point 
  you just say "ok, I created a perfect pack for the first 4 months of
  development, I'll now do subsequent packs on top of that instead".

  The other good news is that a pack is also a natural boundary for fsck 
  (as in "ok, I found that object in a pack, so I won't bother going
  deeper in the reachability chain"), so if you start packing your
  repository, fsck will only have to worry about the objects that are
  unpacked. That makes them work really naturally for archiving, ie this 
  all means that you can avoid a lot of overhead by packing your history
  every once in a while, with it all being entirely transparent.

  In other words, if you just pack every month, you can basically 
  guarantee that fsck costs etc never really go up, and your diskspace 
  also goes up only very slowly. The packed format is quite efficient in 
  many ways, but it is totally immutable (ie you can't add anything to an 
  archive - a pack stays the way it always was, and if you want to pack 
  more you have to either re-do the pack or just create a new one) ]

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 21:40             ` Linus Torvalds
@ 2005-06-26 22:34               ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 22:34 UTC (permalink / raw)
  To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre

On Sun, 26 Jun 2005, Linus Torvalds wrote:
> 
> (You can also sort by filename, if you want to try. I don't track
> filenames at all there and it's actually non-trivial to do, so that would
> require some new and pretty nasty code, but it's possible in _theory_ at
> least.)

Heh. It's actually very easy if you don't take the "name" too seriously, 
and you just pick some random one, namely the first one that was used to 
reach the entry.

Then, you might sort the objects on a hash based on the name, and get 
tons of cheap deltas close-by.

It is _uglee_, but hey, it's a heuristic, and it happens to work pretty 
well. It brought the kernel pack down to 59M even with just a small window 
of 10.

Thanks to Davide for making me think about this hack, although he talked 
about something much more proper (and harder) than this quick and 
ugly heuristic ;)

The main change is that "git-rev-list --objects" has been changed to show
the name the object was reached through.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] Obtain sha1_file_info() for deltified pack entry properly.
       [not found]                     ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
@ 2005-06-28  6:56                       ` Junio C Hamano
  2005-06-28  6:58                         ` Junio C Hamano
                                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28  6:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

I will be sending these three patches:

    [PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly.
    [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'.
    [PATCH 3/3] git-cat-file: '-s' to find out object size.

The first one is slightly different from what I sent earlier to
you privately.  If you have already applied it, please apply the
4-liner alternate patch attached to this message on top of it
for the fix included in the one in this series (and drop the
first one, obviously).

The second and third patches fell out as a bonus while I was
debugging the sha1_file_info().  Especially the third one is in
"because we can do it so cheaply now", not "because I need to
have that feature" category, and I do not mind too much if you
drop it, but I suspect somebody may find it useful.

The "4-liner alternate patch" follows.

------------
Add missing use_packed_git() call.

The function sha1_object_info() was using packed GIT file
without making sure it is mapped, which resulted in
segfaulting.

We would need to introduce unuse_packed_git() call and do proper
use counting to figure out when it is safe to unmap, but
currently we do not unmap packed file yet.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 sha1_file.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -675,6 +675,10 @@ static int packed_object_info(struct pac
 	offset = entry->offset;
 	if (p->pack_size - 5 < offset)
 		die("object offset outside of pack file");
+
+	if (use_packed_git(p))
+		die("cannot map packed file");
+
 	pack = p->pack_base + offset;
 	size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
 	left = p->pack_size - offset - 5;
------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] Obtain sha1_file_info() for deltified pack entry properly.
  2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
@ 2005-06-28  6:58                         ` Junio C Hamano
  2005-06-28  6:58                         ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
  2005-06-28  6:59                         ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
  2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28  6:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

[PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly.

The initial one was not doing enough to figure things out
without uncompressing too much.  It also fixes a potential
segfault resulting from missing use_packed_git() call.

We would need to introduce unuse_packed_git() call and do proper
use counting to figure out when it is safe to unmap, but
currently we do not unmap packed file yet.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 sha1_file.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 69 insertions(+), 4 deletions(-)

d2f58b4aef500835489f30ac5df7985bc21e3c24
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -601,9 +601,70 @@ void * unpack_sha1_file(void *map, unsig
 	return unpack_sha1_rest(&stream, hdr, *size);
 }
 
-/* Returns 0 on fast-path success, returns 1 on deltified
- * and need to unpack to see info.
- */
+static int packed_delta_info(unsigned char *base_sha1,
+			     unsigned long delta_size,
+			     unsigned long left,
+			     char *type,
+			     unsigned long *sizep)
+{
+	unsigned char *data;
+	unsigned char delta_head[64];
+	int i;
+	unsigned char cmd;
+	unsigned long data_size, result_size, base_size, verify_base_size;
+	z_stream stream;
+	int st;
+
+	if (left < 20)
+		die("truncated pack file");
+	if (sha1_object_info(base_sha1, type, &base_size))
+		die("cannot get info for delta-pack base");
+
+	data = base_sha1 + 20;
+	data_size = left - 20;
+
+	memset(&stream, 0, sizeof(stream));
+
+	stream.next_in = data;
+	stream.avail_in = data_size;
+	stream.next_out = delta_head;
+	stream.avail_out = sizeof(delta_head);
+
+	inflateInit(&stream);
+	st = inflate(&stream, Z_FINISH);
+	inflateEnd(&stream);
+	if ((st != Z_STREAM_END) && stream.total_out != sizeof(delta_head))
+		die("delta data unpack-initial failed");
+
+	/* Examine the initial part of the delta to figure out
+	 * the result size.  Verify the base size while we are at it.
+	 */
+	data = delta_head;
+	verify_base_size = i = 0;
+	cmd = *data++;
+	while (cmd) {
+		if (cmd & 1)
+			verify_base_size |= *data++ << i;
+		i += 8;
+		cmd >>= 1;
+	}
+
+	/* Read the result size */
+	result_size = i = 0;
+	cmd = *data++;
+	while (cmd) {
+		if (cmd & 1)
+			result_size |= *data++ << i;
+		i += 8;
+		cmd >>= 1;
+	}
+	if (verify_base_size != base_size)
+		die("delta base size mismatch");
+
+	*sizep = result_size;
+	return 0;
+}
+
 static int packed_object_info(struct pack_entry *entry,
 			      char *type, unsigned long *sizep)
 {
@@ -614,12 +675,16 @@ static int packed_object_info(struct pac
 	offset = entry->offset;
 	if (p->pack_size - 5 < offset)
 		die("object offset outside of pack file");
+
+	if (use_packed_git(p))
+		die("cannot map packed file");
+
 	pack = p->pack_base + offset;
 	size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
 	left = p->pack_size - offset - 5;
 	switch (*pack) {
 	case 'D':
-		return 1;
+		return packed_delta_info(pack+5, size, left, type, sizep);
 		break;
 	case 'C':
 		strcpy(type, "commit");
------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'.
  2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
  2005-06-28  6:58                         ` Junio C Hamano
@ 2005-06-28  6:58                         ` Junio C Hamano
  2005-06-28  6:59                         ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
  2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28  6:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

When trying to find out the type of the object, there is no need
to uncompress the whole object.  Just use sha1_object_info().

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 cat-file.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

fcfdd3d74d359af32dd30aa9f4fe373cebb3e232
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -16,13 +16,11 @@ int main(int argc, char **argv)
 		usage("git-cat-file [-t | tagname] <sha1>");
 
 	if (!strcmp("-t", argv[1])) {
-		buf = read_sha1_file(sha1, type, &size);
-		if (buf) {
-			buf = type;
-			size = strlen(type);
-			type[size] = '\n';
-			size++;
+		if (!sha1_object_info(sha1, type, &size)) {
+			printf("%s\n", type);
+			return 0;
 		}
+		buf = NULL;
 	} else {
 		buf = read_object_with_reference(sha1, argv[1], &size, NULL);
 	}
------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/3] git-cat-file: '-s' to find out object size.
  2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
  2005-06-28  6:58                         ` Junio C Hamano
  2005-06-28  6:58                         ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
@ 2005-06-28  6:59                         ` Junio C Hamano
  2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28  6:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

We use sha1_object_info() now, and getting size is also trivial.

I admit that this is more of "because we can" not "because I see
immediate need for it", though.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 Documentation/git-cat-file.txt |   12 +++++++++---
 cat-file.c                     |   13 ++++++++++---
 2 files changed, 19 insertions(+), 6 deletions(-)

dcf2fc7609509985d4411fb13c6956bcf3c8560f
diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -9,12 +9,13 @@ git-cat-file - Provide content or type i
 
 SYNOPSIS
 --------
-'git-cat-file' (-t | <type>) <object>
+'git-cat-file' (-t | -s | <type>) <object>
 
 DESCRIPTION
 -----------
 Provides content or type of objects in the repository. The type
-is required if '-t' is not being used to find the object type.
+is required unless '-t' is used to find the object type,
+or '-s' is used to find the object size.
 
 OPTIONS
 -------
@@ -25,6 +26,10 @@ OPTIONS
 	Instead of the content, show the object type identified by
 	<object>.
 
+-s::
+	Instead of the content, show the object size identified by
+	<object>.
+
 <type>::
 	Typically this matches the real type of <object> but asking
 	for a type that can trivially dereferenced from the given
@@ -35,7 +40,8 @@ OPTIONS
 
 OUTPUT
 ------
-If '-t' is specified, one of the <type>.
+If '-t' is specified, one of the <type>.  If '-s' is specified,
+the size of the <object> in bytes.
 
 Otherwise the raw (though uncompressed) contents of the <object> will
 be returned.
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -13,11 +13,18 @@ int main(int argc, char **argv)
 	unsigned long size;
 
 	if (argc != 3 || get_sha1(argv[2], sha1))
-		usage("git-cat-file [-t | tagname] <sha1>");
+		usage("git-cat-file [-t | -s | tagname] <sha1>");
 
-	if (!strcmp("-t", argv[1])) {
+	if (!strcmp("-t", argv[1]) || !strcmp("-s", argv[1])) {
 		if (!sha1_object_info(sha1, type, &size)) {
-			printf("%s\n", type);
+			switch (argv[1][1]) {
+			case 't':
+				printf("%s\n", type);
+				break;
+			case 's':
+				printf("%lu\n", size);
+				break;
+			}
 			return 0;
 		}
 		buf = NULL;
------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-26 16:41         ` Linus Torvalds
  2005-06-26 18:39           ` Junio C Hamano
  2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
@ 2005-06-28 18:06           ` Nicolas Pitre
  2005-06-28 19:28             ` Linus Torvalds
  2 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-28 18:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason

On Sun, 26 Jun 2005, Linus Torvalds wrote:

> On Fri, 24 Jun 2005, Linus Torvalds wrote:
> > 
> > yeah, it clearly needs some refining to be useful, but I think you can
> > kind of see how it would work.
> 
> Ok, here's how it works.
> 
[...]
> 
> Nico? Chris? Interested in giving it a look? It's kind of a combination of 
> your things, generalized and then made to have fast lookup with the index.

Wow !

Look away for a few days and when you look back your own stuff has been 
yanked out!

But actually I like this packing thing.  Certainly much nicer to work 
with and also more useful.  And, since the pack file has its own 
checksum  it is possible to use that pack format (without the index 
(which can be recreated from the pack file alone anyway) for network 
transfer.

Here's one improvement to the pack format, breaking it early so it won't 
affect anyone at this point: compressed object header.  Instead of the 
fixed 5 byte header, this patch convert it to a variable size granted 
most object are small enough to save on the storage of the significant 
size bytes which will be zero and packing the non-zero byte position 
with the object type.

This can save up to 2% or 21KB on the test I 
performed with the git repository.

diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -7,7 +7,7 @@
 static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] base-name < object-list";
 
 enum object_type {
-	OBJ_NONE,
+	OBJ_NONE = 0,
 	OBJ_COMMIT,
 	OBJ_TREE,
 	OBJ_BLOB,
@@ -55,7 +55,7 @@ static unsigned long write_object(struct
 	char type[10];
 	void *buf = read_sha1_file(entry->sha1, type, &size);
 	char header[25];
-	unsigned hdrlen, datalen;
+	unsigned hdrlen, datalen, bit;
 
 	if (!buf)
 		die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -63,21 +63,33 @@ static unsigned long write_object(struct
 		die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
 
 	/*
-	 * The object header is a byte of 'type' followed by four bytes of
-	 * length, except for deltas that has the 20 bytes of delta sha
-	 * instead.
+	 * The object header is one byte which low 4 bits represent the
+	 * object type, the 4 upper bits indicates which of the following
+	 * bytes are used to build the object size.  For delta objects the
+	 * sha1 of the reference object is also appended.
 	 */
-	header[0] = ".CTB"[entry->type];
-	hdrlen = 5;
+	header[0] = entry->type;
 	if (entry->delta) {
-		header[0] = 'D';
-		memcpy(header+5, entry->delta, 20);
+		header[0] = OBJ_DELTA;
 		buf = delta_against(buf, size, entry);
 		size = entry->delta_size;
-		hdrlen = 25;
 	}
-	datalen = htonl(size);
-	memcpy(header+1, &datalen, 4);
+	hdrlen = 1;
+	bit = (1 << 4);
+	datalen = size;
+	do {
+		if (datalen & 0xff) {
+			header[0] |= bit;
+			header[hdrlen++] = datalen;
+		}
+		bit <<= 1;
+		datalen >>= 8;
+	} while (datalen);
+	if (entry->delta) {
+		memcpy(header+hdrlen, entry->delta, 20);
+		hdrlen += 20;
+	}
+
 	sha1write(f, header, hdrlen);
 	datalen = sha1write_compressed(f, buf, size);
 	free(buf);
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -12,6 +12,16 @@ struct pack_entry {
 	unsigned char sha1[20];
 };
 
+enum object_type {
+	OBJ_NONE = 0,
+	OBJ_COMMIT,
+	OBJ_TREE,
+	OBJ_BLOB,
+	OBJ_DELTA
+};
+
+static char *type_s[] = { NULL, "commit", "tree", "blob", "delta" };
+
 static void *pack_base;
 static unsigned long pack_size;
 static void *index_base;
@@ -92,7 +102,7 @@ static int check_index(void)
 }
 
 static int unpack_non_delta_entry(struct pack_entry *entry,
-				  int kind,
+				  char *type,
 				  unsigned char *data,
 				  unsigned long size,
 				  unsigned long left)
@@ -101,9 +111,8 @@ static int unpack_non_delta_entry(struct
 	z_stream stream;
 	char *buffer;
 	unsigned char sha1[20];
-	char *type_s;
 
-	printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size);
+	printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
 	if (dry_run)
 		return 0;
 
@@ -120,18 +129,12 @@ static int unpack_non_delta_entry(struct
 	inflateEnd(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size)
 		goto err_finish;
-	switch (kind) {
-	case 'C': type_s = "commit"; break;
-	case 'T': type_s = "tree"; break;
-	case 'B': type_s = "blob"; break;
-	default: goto err_finish;
-	}
-	if (write_sha1_file(buffer, size, type_s, sha1) < 0)
+	if (write_sha1_file(buffer, size, type, sha1) < 0)
 		die("failed to write %s (%s)",
-		    sha1_to_hex(entry->sha1), type_s);
-	printf("%s %s\n", sha1_to_hex(sha1), type_s);
+		    sha1_to_hex(entry->sha1), type);
+	printf("%s %s\n", sha1_to_hex(sha1), type);
 	if (memcmp(sha1, entry->sha1, 20))
-		die("resulting %s have wrong SHA1", type_s);
+		die("resulting %s have wrong SHA1", type);
 
  finish:
 	st = 0;
@@ -183,15 +186,13 @@ static int unpack_delta_entry(struct pac
 		die("truncated pack file");
 	data = base_sha1 + 20;
 	data_size = left - 20;
-	printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+	printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
 	printf(" %s\n", sha1_to_hex(base_sha1));
 
 	if (dry_run)
 		return 0;
 
-	/* pack+5 is the base sha1, unless we have it, we need to
-	 * unpack it first.
-	 */
+	/* unless we have the base sha1, we need to unpack it first. */
 	if (!has_sha1_file(base_sha1)) {
 		struct pack_entry *base;
 		if (!find_pack_entry(base_sha1, &base))
@@ -236,7 +237,9 @@ static int unpack_delta_entry(struct pac
 static void unpack_entry(struct pack_entry *entry)
 {
 	unsigned long offset, size, left;
-	unsigned char *pack;
+	unsigned char *pack, sizebits;
+	enum object_type type;
+	int i;
 
 	/* Have we done this one already due to deltas based on it? */
 	if (lookup_object(entry->sha1))
@@ -246,14 +249,25 @@ static void unpack_entry(struct pack_ent
 	if (offset > pack_size - 5)
 		die("object offset outside of pack file");
 	pack = pack_base + offset;
-	size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
-	left = pack_size - offset - 5;
-	switch (*pack) {
-	case 'C': case 'T': case 'B':
-		unpack_non_delta_entry(entry, *pack, pack+5, size, left);
+	sizebits = *pack++;
+	type = sizebits & 0x0f;
+	sizebits >>= 4;
+	i = size = 0;
+	while (sizebits) {
+		if (sizebits & 1)
+			size |= *pack++ << i;
+		i += 8;
+		sizebits >>= 1;
+	}
+	left = pack_size - ((void *)pack - pack_base);
+	switch (type) {
+		case OBJ_COMMIT:
+		case OBJ_TREE:
+		case OBJ_BLOB:
+		unpack_non_delta_entry(entry, type_s[type], pack, size, left);
 		break;
-	case 'D':
-		unpack_delta_entry(entry, pack+5, size, left);
+	case OBJ_DELTA:
+		unpack_delta_entry(entry, pack, size, left);
 		break;
 	default:
 		die("corrupted pack file");

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-28 18:06           ` Nicolas Pitre
@ 2005-06-28 19:28             ` Linus Torvalds
  2005-06-28 21:08               ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-28 19:28 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason

On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> 
> Here's one improvement to the pack format, breaking it early so it won't 
> affect anyone at this point: compressed object header.  Instead of the 
> fixed 5 byte header, this patch convert it to a variable size granted 
> most object are small enough to save on the storage of the significant 
> size bytes which will be zero and packing the non-zero byte position 
> with the object type.

Ok, this is against an older version and doesn't have that "read_sha1" 
thing, but yes, something like this would work.

I'd prefer the encoding to be a bit different, though: make the size be
encoded in seven bits per byte, with the high bit meaning "more to come". 
We can use four bits from the "type" byte for the initial value, making 
lengths 0-15 be free.

	unsigned long size;
	unsigned char c;

	c = *pack++;
	type = 
	size = c & 15;
	type = (c >> 4) & 7;
	while (c & 0x80) {
		c = *pack++;
		size = (size << 7) + (pack & 0x7f);
	}

or something. That's even denser.

However, I also end up wanting to add a "global header" to the pack-file, 
that contains at least the number of objects packed. We may not know how 
big the pack-file will be, but we'll at least know how many objects it 
has before we start writing it.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-28 19:28             ` Linus Torvalds
@ 2005-06-28 21:08               ` Nicolas Pitre
  2005-06-28 21:27                 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-28 21:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason

On Tue, 28 Jun 2005, Linus Torvalds wrote:

> 
> 
> On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> > 
> > Here's one improvement to the pack format, breaking it early so it won't 
> > affect anyone at this point: compressed object header.  Instead of the 
> > fixed 5 byte header, this patch convert it to a variable size granted 
> > most object are small enough to save on the storage of the significant 
> > size bytes which will be zero and packing the non-zero byte position 
> > with the object type.
> 
> Ok, this is against an older version and doesn't have that "read_sha1" 
> thing, but yes, something like this would work.
> 
> I'd prefer the encoding to be a bit different, though: make the size be
> encoded in seven bits per byte, with the high bit meaning "more to come". 
> We can use four bits from the "type" byte for the initial value, making 
> lengths 0-15 be free.
> 
> 	unsigned long size;
> 	unsigned char c;
> 
> 	c = *pack++;
> 	type = 
> 	size = c & 15;
> 	type = (c >> 4) & 7;
> 	while (c & 0x80) {
> 		c = *pack++;
> 		size = (size << 7) + (pack & 0x7f);
> 	}
> 
> or something. That's even denser.

OK.  New patch below.

> However, I also end up wanting to add a "global header" to the pack-file, 
> that contains at least the number of objects packed. We may not know how 
> big the pack-file will be, but we'll at least know how many objects it 
> has before we start writing it.

Probably a version signature would be a good thing too.

=====

This patch adds compressed object header.  Instead of the fixed 5 byte 
header, this patch convert it to a variable size granted most object are 
small enough to save on the storage of the size's most significant bytes 
which are zero, even ffolding bits with the object type.  Objects up to 
2047 bytes in size will have a header of only 2 bytes.

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -3,24 +3,17 @@
 #include "object.h"
 #include "delta.h"
 #include "csum-file.h"
+#include "pack.h"
 
 static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] {--stdout | base-name} < object-list";
 
-/*
- * The object type is a single-character shorthand:
- *  - 'C' for "Commit"
- *  - 'T' for "Tree"
- *  - 'B' for "Blob"
- *  - 'G' for "taG"
- *  - 'D' for "Delta"
- */
 struct object_entry {
 	unsigned char sha1[20];
 	unsigned long size;
 	unsigned long offset;
 	unsigned int depth;
 	unsigned int hash;
-	unsigned char type;
+	packed_obj_type type;
 	unsigned long delta_size;
 	struct object_entry *delta;
 };
@@ -63,21 +56,30 @@ static unsigned long write_object(struct
 		die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
 
 	/*
-	 * The object header is a byte of 'type' followed by four bytes of
-	 * length, except for deltas that has the 20 bytes of delta sha
-	 * instead.
+	 * The object header first byte has its low 4 bits representing the
+	 * object type, the 4 upper bits indicating which of the following
+	 * bytes are used to build the object size.  For delta objects the
+	 * sha1 of the reference object is also appended.
 	 */
 	header[0] = entry->type;
-	hdrlen = 5;
 	if (entry->delta) {
-		header[0] = 'D';
-		memcpy(header+5, entry->delta, 20);
+		header[0] = PACKED_DELTA;
 		buf = delta_against(buf, size, entry);
 		size = entry->delta_size;
-		hdrlen = 25;
 	}
-	datalen = htonl(size);
-	memcpy(header+1, &datalen, 4);
+	header[0] |= size << 3;
+	hdrlen = 1;
+	datalen = size >> 4;
+	while (datalen) {
+		header[hdrlen - 1] |= 0x80;
+		header[hdrlen++] = datalen;
+		datalen >>= 7;
+	}
+	if (entry->delta) {
+		memcpy(header+hdrlen, entry->delta, 20);
+		hdrlen += 20;
+	}
+
 	sha1write(f, header, hdrlen);
 	datalen = sha1write_compressed(f, buf, size);
 	free(buf);
@@ -168,13 +170,13 @@ static void check_object(struct object_e
 
 	if (!sha1_object_info(entry->sha1, type, &entry->size)) {
 		if (!strcmp(type, "commit")) {
-			entry->type = 'C';
+			entry->type = PACKED_COMMIT;
 		} else if (!strcmp(type, "tree")) {
-			entry->type = 'T';
+			entry->type = PACKED_TREE;
 		} else if (!strcmp(type, "blob")) {
-			entry->type = 'B';
+			entry->type = PACKED_BLOB;
 		} else if (!strcmp(type, "tag")) {
-			entry->type = 'G';
+			entry->type = PACKED_TAG;
 		} else
 			die("unable to pack object %s of type %s",
 			    sha1_to_hex(entry->sha1), type);
diff --git a/pack.h b/pack.h
new file mode 100644
--- /dev/null
+++ b/pack.h
@@ -0,0 +1,19 @@
+#ifndef PACK_H
+#define PACK_H
+
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
+
+typedef enum {
+	PACKED_RESERVED = 0,
+	PACKED_COMMIT = 1,
+	PACKED_TREE = 2,
+	PACKED_BLOB = 3,
+	PACKED_TAG = 4,
+	PACKED_DELTA = 7
+} packed_obj_type;
+
+#endif
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "object.h"
 #include "delta.h"
+#include "pack.h"
 
 static int dry_run;
 static int nr_entries;
@@ -12,6 +13,14 @@ struct pack_entry {
 	unsigned char sha1[20];
 };
 
+static char *type_s[] = {
+	[PACKED_COMMIT]	= "commit",
+	[PACKED_TREE]	= "tree",
+	[PACKED_BLOB]	= "blob",
+	[PACKED_TAG]	= "tag",
+	[PACKED_DELTA]	"delta"
+};
+
 static void *pack_base;
 static unsigned long pack_size;
 static void *index_base;
@@ -92,7 +101,7 @@ static int check_index(void)
 }
 
 static int unpack_non_delta_entry(struct pack_entry *entry,
-				  int kind,
+				  char *type,
 				  unsigned char *data,
 				  unsigned long size,
 				  unsigned long left)
@@ -101,9 +110,8 @@ static int unpack_non_delta_entry(struct
 	z_stream stream;
 	char *buffer;
 	unsigned char sha1[20];
-	char *type_s;
 
-	printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size);
+	printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
 	if (dry_run)
 		return 0;
 
@@ -120,22 +128,15 @@ static int unpack_non_delta_entry(struct
 	inflateEnd(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size)
 		goto err_finish;
-	switch (kind) {
-	case 'C': type_s = "commit"; break;
-	case 'T': type_s = "tree"; break;
-	case 'B': type_s = "blob"; break;
-	case 'G': type_s = "tag"; break;
-	default: goto err_finish;
-	}
-	if (write_sha1_file(buffer, size, type_s, sha1) < 0)
+	if (write_sha1_file(buffer, size, type, sha1) < 0)
 		die("failed to write %s (%s)",
-		    sha1_to_hex(entry->sha1), type_s);
-	printf("%s %s\n", sha1_to_hex(sha1), type_s);
+		    sha1_to_hex(entry->sha1), type);
+	printf("%s %s\n", sha1_to_hex(sha1), type);
 	if (memcmp(sha1, entry->sha1, 20))
-		die("resulting %s have wrong SHA1", type_s);
+		die("resulting %s have wrong SHA1", type);
 
- finish:
 	st = 0;
+ finish:
 	free(buffer);
 	return st;
  err_finish:
@@ -184,15 +185,13 @@ static int unpack_delta_entry(struct pac
 		die("truncated pack file");
 	data = base_sha1 + 20;
 	data_size = left - 20;
-	printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+	printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
 	printf(" %s\n", sha1_to_hex(base_sha1));
 
 	if (dry_run)
 		return 0;
 
-	/* pack+5 is the base sha1, unless we have it, we need to
-	 * unpack it first.
-	 */
+	/* unless we have the base sha1, we need to unpack it first. */
 	if (!has_sha1_file(base_sha1)) {
 		struct pack_entry *base;
 		if (!find_pack_entry(base_sha1, &base))
@@ -237,7 +236,9 @@ static int unpack_delta_entry(struct pac
 static void unpack_entry(struct pack_entry *entry)
 {
 	unsigned long offset, size, left;
-	unsigned char *pack;
+	unsigned char *pack, sizebits;
+	packed_obj_type type;
+	int i;
 
 	/* Have we done this one already due to deltas based on it? */
 	if (lookup_object(entry->sha1))
@@ -247,17 +248,28 @@ static void unpack_entry(struct pack_ent
 	if (offset > pack_size - 5)
 		die("object offset outside of pack file");
 	pack = pack_base + offset;
-	size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
-	left = pack_size - offset - 5;
-	switch (*pack) {
-	case 'C': case 'T': case 'B': case 'G':
-		unpack_non_delta_entry(entry, *pack, pack+5, size, left);
+	sizebits = *pack++;
+	type = sizebits & 0x07;
+	size = (sizebits & ~0x80) >> 3;
+	i = 4;
+	while (sizebits & 0x80) {
+		sizebits = *pack++;
+		size |= (sizebits & ~0x80) << i;
+		i += 7;
+	}
+	left = pack_size - ((void *)pack - pack_base);
+	switch (type) {
+	case PACKED_COMMIT:
+	case PACKED_TREE:
+	case PACKED_BLOB:
+	case PACKED_TAG:
+		unpack_non_delta_entry(entry, type_s[type], pack, size, left);
 		break;
-	case 'D':
-		unpack_delta_entry(entry, pack+5, size, left);
+	case PACKED_DELTA:
+		unpack_delta_entry(entry, pack, size, left);
 		break;
 	default:
-		die("corrupted pack file");
+		die("corrupted pack file(unknown object type %d)", type);
 	}
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-28 21:08               ` Nicolas Pitre
@ 2005-06-28 21:27                 ` Linus Torvalds
  2005-06-28 21:55                   ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
  2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-28 21:27 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason

On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> 
> OK.  New patch below.

Dammit, I wasted all that time doing it myself.

I just committed and pushed out my version. But mine also does sha1_file.c 
right, so that you can use a packed archive in .git/objects/pack. Yours 
has some other cleanups, so..

Can you double-check my version (it hasn't mirrored out yet, it seems, but 
it should be there soon).

> Probably a version signature would be a good thing too.

I did that too, same format as for the index file. Nothing checks it 
though.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] Bugfix: initialize pack_base to NULL.
  2005-06-28 21:27                 ` Linus Torvalds
@ 2005-06-28 21:55                   ` Junio C Hamano
  2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 21:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

This was causing random segfaults, because use_packed_git() got
confused by random garbage there.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 sha1_file.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

facb119577a28bbb3f2ac1e5f8db37fd2f6d31d8
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -395,6 +395,7 @@ static struct packed_git *add_packed_git
 	p->pack_size = st.st_size;
 	p->index_base = idx_map;
 	p->next = NULL;
+	p->pack_base = NULL;
 	p->pack_last_used = 0;
 	return p;
 }
------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-28 21:27                 ` Linus Torvalds
  2005-06-28 21:55                   ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
@ 2005-06-29  3:55                   ` Nicolas Pitre
  2005-06-29  5:16                     ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-29  3:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Tue, 28 Jun 2005, Linus Torvalds wrote:

> 
> 
> On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> > 
> > OK.  New patch below.
> 
> Dammit, I wasted all that time doing it myself.
> 
> I just committed and pushed out my version. But mine also does sha1_file.c 
> right, so that you can use a packed archive in .git/objects/pack. Yours 
> has some other cleanups, so..
> 
> Can you double-check my version (it hasn't mirrored out yet, it seems, but 
> it should be there soon).

OK... See below the cleanups I merged from my version on top of yours:

 pack-objects.c   |   70 ++++++++++++++-----------------------------------------
 pack.h           |   17 ++++++++-----
 unpack-objects.c |   66 +++++++++++++++++++++++++--------------------------
 3 files changed, 63 insertions(+), 90 deletions(-)

I also restored my original object header size ordering (little endian) 
for two reasons:

 - it is much simpler to generate and therefore allows for removing 
   quite some code

 - it allows for stable bit position which makes it much easier to look 
   at an hex dump of the binary data for manual debugging

Also a few code optimizations and one error return fix.

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -34,7 +34,7 @@ static void *delta_against(void *buf, un
 	if (!otherbuf)
 		die("unable to read %s", sha1_to_hex(entry->delta->sha1));
         delta_buf = diff_delta(otherbuf, othersize,
-			       buf, size, &delta_size, ~0UL);
+			       buf, size, &delta_size, 0UL);
         if (!delta_buf || delta_size != entry->delta_size)
         	die("delta size changed");
         free(buf);
@@ -42,54 +42,13 @@ static void *delta_against(void *buf, un
 	return delta_buf;
 }
 
-/*
- * The per-object header is a pretty dense thing, which is
- *  - first byte: low four bits are "size", then three bits of "type",
- *    and the high bit is "size continues".
- *  - each byte afterwards: low seven bits are size continuation,
- *    with the high bit being "size continues"
- */
-static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
-{
-	int n = 1, i;
-	unsigned char c;
-
-	if (type < OBJ_COMMIT || type > OBJ_DELTA)
-		die("bad type %d", type);
-
-	/*
-	 * Shift the size up by 7 bits at a time,
-	 * until you get bits in the "high four".
-	 * That will be our beginning. We'll have
-	 * four size bits in 28..31, then groups
-	 * of seven in 21..27, 14..20, 7..13 and
-	 * finally 0..6.
-	 */
-	if (size) {
-		n = 5;
-		while (!(size & 0xfe000000)) {
-			size <<= 7;
-			n--;
-		}
-	}
-	c = (type << 4) | (size >> 28);
-	for (i = 1; i < n; i++) {
-		*hdr++ = c | 0x80;
-		c = (size >> 21) & 0x7f;
-		size <<= 7;
-	}
-	*hdr = c;
-	return n;
-}
-
 static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
 {
 	unsigned long size;
 	char type[10];
 	void *buf = read_sha1_file(entry->sha1, type, &size);
-	unsigned char header[10];
+	char header[25];
 	unsigned hdrlen, datalen;
-	enum object_type obj_type;
 
 	if (!buf)
 		die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -97,22 +56,31 @@ static unsigned long write_object(struct
 		die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
 
 	/*
-	 * The object header is a byte of 'type' followed by zero or
-	 * more bytes of length.  For deltas, the 20 bytes of delta sha1
-	 * follows that.
+	 * The object header first byte has its low 3 bits representing the
+	 * object type, the 4 upper bits indicating which of the following
+	 * bytes are used to build the object size.  For delta objects the
+	 * sha1 of the reference object is also appended.
 	 */
-	obj_type = entry->type;
 	if (entry->delta) {
+		header[0] = OBJ_DELTA;
 		buf = delta_against(buf, size, entry);
 		size = entry->delta_size;
-		obj_type = OBJ_DELTA;
+	} else
+		header[0] = entry->type;
+	header[0] |= size << 3;
+	hdrlen = 1;
+	datalen = size >> 4;
+	while (datalen) {
+		header[hdrlen - 1] |= 0x80;
+		header[hdrlen++] = datalen;
+		datalen >>= 7;
 	}
-	hdrlen = encode_header(obj_type, size, header);
-	sha1write(f, header, hdrlen);
 	if (entry->delta) {
-		sha1write(f, entry->delta, 20);
+		memcpy(header+hdrlen, entry->delta, 20);
 		hdrlen += 20;
 	}
+
+	sha1write(f, header, hdrlen);
 	datalen = sha1write_compressed(f, buf, size);
 	free(buf);
 	return hdrlen + datalen;
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -1,13 +1,18 @@
 #ifndef PACK_H
 #define PACK_H
 
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
 enum object_type {
-	OBJ_NONE,
-	OBJ_COMMIT,
-	OBJ_TREE,
-	OBJ_BLOB,
-	OBJ_TAG,
-	OBJ_DELTA,
+	OBJ_EXT = 0,
+	OBJ_COMMIT = 1,
+	OBJ_TREE = 2,
+	OBJ_BLOB = 3,
+	OBJ_TAG = 4,
+	OBJ_DELTA = 7
 };
 
 /*
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -13,6 +13,14 @@ struct pack_entry {
 	unsigned char sha1[20];
 };
 
+static char *type_string[] = {
+	[OBJ_COMMIT]	= "commit",
+	[OBJ_TREE]	= "tree",
+	[OBJ_BLOB]	= "blob",
+	[OBJ_TAG]	= "tag",
+	[OBJ_DELTA]	= "delta"
+};
+
 static void *pack_base;
 static unsigned long pack_size;
 static void *index_base;
@@ -93,7 +101,7 @@ static int check_index(void)
 }
 
 static int unpack_non_delta_entry(struct pack_entry *entry,
-				  enum object_type kind,
+				  char *type,
 				  unsigned char *data,
 				  unsigned long size,
 				  unsigned long left)
@@ -102,9 +110,8 @@ static int unpack_non_delta_entry(struct
 	z_stream stream;
 	char *buffer;
 	unsigned char sha1[20];
-	char *type;
 
-	printf("%s %c %lu\n", sha1_to_hex(entry->sha1), ".CTBGD"[kind], size);
+	printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
 	if (dry_run)
 		return 0;
 
@@ -121,13 +128,6 @@ static int unpack_non_delta_entry(struct
 	inflateEnd(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size)
 		goto err_finish;
-	switch (kind) {
-	case OBJ_COMMIT: type = "commit"; break;
-	case OBJ_TREE:   type = "tree"; break;
-	case OBJ_BLOB:   type = "blob"; break;
-	case OBJ_TAG:    type = "tag"; break;
-	default: goto err_finish;
-	}
 	if (write_sha1_file(buffer, size, type, sha1) < 0)
 		die("failed to write %s (%s)",
 		    sha1_to_hex(entry->sha1), type);
@@ -135,8 +135,8 @@ static int unpack_non_delta_entry(struct
 	if (memcmp(sha1, entry->sha1, 20))
 		die("resulting %s have wrong SHA1", type);
 
- finish:
 	st = 0;
+ finish:
 	free(buffer);
 	return st;
  err_finish:
@@ -185,15 +185,13 @@ static int unpack_delta_entry(struct pac
 		die("truncated pack file");
 	data = base_sha1 + 20;
 	data_size = left - 20;
-	printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+	printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
 	printf(" %s\n", sha1_to_hex(base_sha1));
 
 	if (dry_run)
 		return 0;
 
-	/* pack+5 is the base sha1, unless we have it, we need to
-	 * unpack it first.
-	 */
+	/* unless we have the base sha1, we need to unpack it first. */
 	if (!has_sha1_file(base_sha1)) {
 		struct pack_entry *base;
 		if (!find_pack_entry(base_sha1, &base))
@@ -238,8 +236,9 @@ static int unpack_delta_entry(struct pac
 static void unpack_entry(struct pack_entry *entry)
 {
 	unsigned long offset, size, left;
-	unsigned char *pack, c;
-	int type;
+	unsigned char c, *pack = pack_base;
+	int i;
+	enum object_type type;
 
 	/* Have we done this one already due to deltas based on it? */
 	if (lookup_object(entry->sha1))
@@ -247,20 +246,17 @@ static void unpack_entry(struct pack_ent
 
 	offset = ntohl(entry->offset);
 	if (offset >= pack_size)
-		goto bad;
-
-	pack = pack_base + offset;
-	c = *pack++;
-	offset++;
-	type = (c >> 4) & 7;
-	size = (c & 15);
+		goto out_of_bound;
+	c = pack[offset++];
+	type = c & 0x07;
+	size = (c & ~0x80) >> 3;
+	i = 4;
 	while (c & 0x80) {
 		if (offset >= pack_size)
-			goto bad;
-		offset++;
-		c = *pack++;
-		size = (size << 7) + (c & 0x7f);
-		
+			goto out_of_bound;
+		c = pack[offset++];
+		size |= (c & ~0x80) << i;
+		i += 7;
 	}
 	left = pack_size - offset;
 	switch (type) {
@@ -268,14 +264,18 @@ static void unpack_entry(struct pack_ent
 	case OBJ_TREE:
 	case OBJ_BLOB:
 	case OBJ_TAG:
-		unpack_non_delta_entry(entry, type, pack, size, left);
+		unpack_non_delta_entry(entry, type_string[type],
+				       pack+offset, size, left);
 		return;
 	case OBJ_DELTA:
-		unpack_delta_entry(entry, pack, size, left);
+		unpack_delta_entry(entry, pack+offset, size, left);
 		return;
+	default:
+		die("corrupted pack file(unknown object type %d)", type);
 	}
-bad:
-	die("corrupted pack file");
+
+ out_of_bound:
+	die("corrupted pack file (object offset out of bound)");
 }
 
 /*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
@ 2005-06-29  5:16                     ` Nicolas Pitre
  2005-06-29  5:43                       ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-29  5:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Tue, 28 Jun 2005, Nicolas Pitre wrote:

> OK... See below the cleanups I merged from my version on top of yours:

Of course by the time I sent the above you already rewrote the ting to 
be streamable.

So again :-) see below the cleanups I merged from my version on top of 
yours:

 pack-objects.c   |   70 ++++++++++++++-----------------------------------------
 pack.h           |   17 ++++++++-----
 unpack-objects.c |   29 ++++++++++++----------
 3 files changed, 46 insertions(+), 70 deletions(-)

I also restored my original object header size ordering (little endian) 
for two reasons:

 - it is much simpler to generate and therefore allows for removing 
   quite some code

 - it allows for stable bit position which makes it much easier to look 
   at an hex dump of the binary data for manual debugging

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -34,7 +34,7 @@ static void *delta_against(void *buf, un
 	if (!otherbuf)
 		die("unable to read %s", sha1_to_hex(entry->delta->sha1));
         delta_buf = diff_delta(otherbuf, othersize,
-			       buf, size, &delta_size, ~0UL);
+			       buf, size, &delta_size, 0UL);
         if (!delta_buf || delta_size != entry->delta_size)
         	die("delta size changed");
         free(buf);
@@ -42,54 +42,13 @@ static void *delta_against(void *buf, un
 	return delta_buf;
 }
 
-/*
- * The per-object header is a pretty dense thing, which is
- *  - first byte: low four bits are "size", then three bits of "type",
- *    and the high bit is "size continues".
- *  - each byte afterwards: low seven bits are size continuation,
- *    with the high bit being "size continues"
- */
-static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
-{
-	int n = 1, i;
-	unsigned char c;
-
-	if (type < OBJ_COMMIT || type > OBJ_DELTA)
-		die("bad type %d", type);
-
-	/*
-	 * Shift the size up by 7 bits at a time,
-	 * until you get bits in the "high four".
-	 * That will be our beginning. We'll have
-	 * four size bits in 28..31, then groups
-	 * of seven in 21..27, 14..20, 7..13 and
-	 * finally 0..6.
-	 */
-	if (size) {
-		n = 5;
-		while (!(size & 0xfe000000)) {
-			size <<= 7;
-			n--;
-		}
-	}
-	c = (type << 4) | (size >> 28);
-	for (i = 1; i < n; i++) {
-		*hdr++ = c | 0x80;
-		c = (size >> 21) & 0x7f;
-		size <<= 7;
-	}
-	*hdr = c;
-	return n;
-}
-
 static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
 {
 	unsigned long size;
 	char type[10];
 	void *buf = read_sha1_file(entry->sha1, type, &size);
-	unsigned char header[10];
+	char header[25];
 	unsigned hdrlen, datalen;
-	enum object_type obj_type;
 
 	if (!buf)
 		die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -97,22 +56,31 @@ static unsigned long write_object(struct
 		die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
 
 	/*
-	 * The object header is a byte of 'type' followed by zero or
-	 * more bytes of length.  For deltas, the 20 bytes of delta sha1
-	 * follows that.
+	 * The object header first byte has its low 3 bits representing the
+	 * object type, the 4 upper bits indicating which of the following
+	 * bytes are used to build the object size.  For delta objects the
+	 * sha1 of the reference object is also appended.
 	 */
-	obj_type = entry->type;
 	if (entry->delta) {
+		header[0] = OBJ_DELTA;
 		buf = delta_against(buf, size, entry);
 		size = entry->delta_size;
-		obj_type = OBJ_DELTA;
+	} else
+		header[0] = entry->type;
+	header[0] |= size << 3;
+	hdrlen = 1;
+	datalen = size >> 4;
+	while (datalen) {
+		header[hdrlen - 1] |= 0x80;
+		header[hdrlen++] = datalen;
+		datalen >>= 7;
 	}
-	hdrlen = encode_header(obj_type, size, header);
-	sha1write(f, header, hdrlen);
 	if (entry->delta) {
-		sha1write(f, entry->delta, 20);
+		memcpy(header+hdrlen, entry->delta, 20);
 		hdrlen += 20;
 	}
+
+	sha1write(f, header, hdrlen);
 	datalen = sha1write_compressed(f, buf, size);
 	free(buf);
 	return hdrlen + datalen;
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -1,13 +1,18 @@
 #ifndef PACK_H
 #define PACK_H
 
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
 enum object_type {
-	OBJ_NONE,
-	OBJ_COMMIT,
-	OBJ_TREE,
-	OBJ_BLOB,
-	OBJ_TAG,
-	OBJ_DELTA,
+	OBJ_EXT = 0,
+	OBJ_COMMIT = 1,
+	OBJ_TREE = 2,
+	OBJ_BLOB = 3,
+	OBJ_TAG = 4,
+	OBJ_DELTA = 7
 };
 
 /*
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -6,6 +6,14 @@
 static int dry_run;
 static const char unpack_usage[] = "git-unpack-objects < pack-file";
 
+static char *type_string[] = {
+	[OBJ_COMMIT]	= "commit",
+	[OBJ_TREE]	= "tree",
+	[OBJ_BLOB]	= "blob",
+	[OBJ_TAG]	= "tag",
+	[OBJ_DELTA]	= "delta"
+};
+
 /* We always read in 4kB chunks. */
 static unsigned char buffer[4096];
 static unsigned long offset, len, eof;
@@ -139,19 +147,11 @@ static void added_object(unsigned char *
 	}
 }
 
-static int unpack_non_delta_entry(enum object_type kind, unsigned long size)
+static int unpack_non_delta_entry(char *type, unsigned long size)
 {
 	void *buf = get_data(size);
 	unsigned char sha1[20];
-	char *type;
 
-	switch (kind) {
-	case OBJ_COMMIT: type = "commit"; break;
-	case OBJ_TREE:   type = "tree"; break;
-	case OBJ_BLOB:   type = "blob"; break;
-	case OBJ_TAG:    type = "tag"; break;
-	default: die("bad type %d", kind);
-	}
 	if (write_sha1_file(buf, size, type, sha1) < 0)
 		die("failed to write object");
 	added_object(sha1, type, buf, size);
@@ -184,26 +184,29 @@ static int unpack_delta_entry(unsigned l
 static void unpack_one(void)
 {
 	unsigned char *pack, c;
+	unsigned int i;
 	unsigned long size;
 	enum object_type type;
 
 	pack = fill(1);
 	c = *pack;
 	use(1);
-	type = (c >> 4) & 7;
-	size = (c & 15);
+	type = c & 0x07;
+	size = (c & ~0x80) >> 3;
+	i = 4;
 	while (c & 0x80) {
 		pack = fill(1);
 		c = *pack++;
 		use(1);
-		size = (size << 7) + (c & 0x7f);
+		size |= (c & ~0x80) << i;
+		i += 7;
 	}
 	switch (type) {
 	case OBJ_COMMIT:
 	case OBJ_TREE:
 	case OBJ_BLOB:
 	case OBJ_TAG:
-		unpack_non_delta_entry(type, size);
+		unpack_non_delta_entry(type_string[type], size);
 		return;
 	case OBJ_DELTA:
 		unpack_delta_entry(size);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-29  5:16                     ` Nicolas Pitre
@ 2005-06-29  5:43                       ` Linus Torvalds
  2005-06-29  5:54                         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29  5:43 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List



On Wed, 29 Jun 2005, Nicolas Pitre wrote:
> 
> Of course by the time I sent the above you already rewrote the ting to 
> be streamable.

And by the time you sent me a new version, I'd already taken part of your 
old one by hand ;)

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
  2005-06-29  5:43                       ` Linus Torvalds
@ 2005-06-29  5:54                         ` Linus Torvalds
  2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29  5:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List

On Tue, 28 Jun 2005, Linus Torvalds wrote:
> 
> On Wed, 29 Jun 2005, Nicolas Pitre wrote:
> > 
> > Of course by the time I sent the above you already rewrote the ting to 
> > be streamable.
> 
> And by the time you sent me a new version, I'd already taken part of your 
> old one by hand ;)

Btw, I have the size/type bits reversed from your setup, but please don't 
change that, since that would be yet another incompatible pack format 
change, and I'd like to calm things down.

Also, I notice that you decode the sizes really strangely: you have a 
"while() { }" loop and two separate loads. It's much nicer to do it with a 
"do { } while()" loop and a single load, since not only is it less code, 
a do-while loop compiles to better code than a while() loop (unless the 
compiler is crazy, which it sometimes is).

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Last mile for 1.0 again
  2005-06-29  5:54                         ` Linus Torvalds
@ 2005-06-29  7:16                           ` Junio C Hamano
  2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
  2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
  0 siblings, 2 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-29  7:16 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> ..., but please don't 
LT> change that, since that would be yet another incompatible pack format 
LT> change, and I'd like to calm things down.

Calming things down is good.  Some stuff we should be looking at
during calming down period are:

 - Ask expert help to verify our use of zlib is OK; Linus asked
   for this in a message tonight.  I am not qualified to do this
   myself.

 - There is currently nothing that verifies the SHA1 sums
   embedded in idx and pack files.  A tool for it is needed
   [*1*].  I may be doing this myself.

 - Although there is a hook in sha1_file() that allows mapping a
   pack file on demand (and keep track of their uses for LRU
   eviction), there is currently no code to actually evict a
   mapped pack file.  We need to add some sort of use-reference
   counting to lock down a pack file in use, if somebody is
   inclined to do this.  I will not be doing this myself
   immediately.

 - Think about what to do with the "dumb server" pull protocols
   regarding packed archive files (think about http-pull).  I
   will not be doing this myself.

 - Design "smart server" pull/push pair.  This can independently
   and actively be done without impacting "calming down" period,
   and I am assuming Dan is already looking into this.  This may
   result in another "deathmatch" between Dan and Jason
   Mcmullan, which would be fun to watch ;-).

Not limited to "calming down the packed GIT", but bigger picture
pre-release preparation items I think we need to look into are:

 - Update "Last mile for 1.0" list.  I think the only thing that
   is left from the one I posted on Jun 5th is the tutorials.

   Anybody interested in adding more pages to the tutorials?  If
   there is no taker, I may start continuing what is already
   there, stealing example project outline from "Arch meets
   Hello World" [*2*].

 - Double check the documentation, usage strings in the code and
   what the code does still match.  Last time I checked, I think
   some files in Documentation/ directory were not reachable
   from the main GIT(1) page and were not touched by the build
   process from Documentation/Makefile.  I will not be doing
   this myself.

 - Blame/Annotate.  Does anybody have a fast and correct one
   [*3*]?

 - Publicity.  Somebody may want to start preparing materials to
   have us included in http://better-scm.berlios.de/comparison/

[Footnotes]

*1* One possibility is to add that to fsck-cache when --full is
used, but I am somewhat reluctant to do it that way.  It would
require you to place that "suspicious" pack under your
repository's .git/objects/pack in order to verify it, which can
spread the damage before you know it.

In general, idx file is relatively small, so we _could_ check
the SHA1 sum for idx files when we map them at runtime in
check_packed_git_idx().  However, verifying 1MB (kernel archive
case) idx file every time we run a single core GIT command may
be a price we would not want to pay, so I am not too
enthusiastic about doing it.  We certainly would not want to do
this for pack files every time a pack is mapped at runtime (60MB
kernel archive case).

*2* http://regexps.srparish.net/www/tutorial/html/arch.html.

I am not talking about stealing the concepts but just talking
about stealing the sample project on which example players work
in the sample sessions.  I think its section on update/replay
matches closely to merge/rebase, for example.

*3* I have what I wrote in Perl which does rename/copy/rewrite
and multiple parents correctly, but it is way too slow for
on-demand use.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] Add git-verify-pack command.
  2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
@ 2005-06-29  9:51                             ` Junio C Hamano
  2005-06-29 16:15                               ` Linus Torvalds
  2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-29  9:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Given a list of <pack>.idx files, this command validates the
index file and the corresponding .pack file for consistency.

This patch also uses the same validation mechanism in fsck-cache
when the --full flag is used.

During normal operation, sha1_file.c verifies that a given .idx
file matches the .pack file by comparing the SHA1 checksum
stored in .idx file and .pack file as a minimum sanity check.
We may further want to check the pack signature and version when
we map the pack, but that would be a separate patch.

Earlier, errors to map a pack file was not flagged fatal but led
to a random fatal error later.  This version explicitly die()s
when such an error is detected.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

*** This actually covers two of the items I listed in the "Last
*** mile again" message.  I ended up doing the LRU myself.

*** Linus, what would be the practical/recommended/sane limit
*** for mmap regions we would want to use?  Currently I start
*** throwing away mapped packs when the total size of mapped
*** packs exceeds 64MB.

*** Currently, pack-objects/unpack-objects pair is not
*** documented.  Any takers?

*** Oh, again, Linus, what is your preferred way to get a
*** cover-letter material (like this) for a single patch?  I
*** always do a separate cover letter for multi-patch series
*** with [PATCH 0/N], but do you prefer to get [PATCH 0/1] and
*** [PATCH 1/1] for a single patch, or do you prefer to see
*** commentaries after the three-dash line like this message
*** does?

 Documentation/git-verify-pack.txt |   38 +++++++++++++++++++++
 Documentation/git.txt             |    3 ++
 Makefile                          |    5 ++-
 cache.h                           |    4 ++
 fsck-cache.c                      |    5 +++
 pack.h                            |    2 +
 sha1_file.c                       |   66 ++++++++++++++++++++++++++++---------
 t/t5300-pack-object.sh            |   38 +++++++++++++++++++++
 verify-pack.c                     |   26 +++++++++++++++
 verify_pack.c                     |   62 +++++++++++++++++++++++++++++++++++
 10 files changed, 231 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/git-verify-pack.txt
 create mode 100644 verify-pack.c
 create mode 100644 verify_pack.c

8ac4f374a587c58b3d2ecc44220b6b866f769350
diff --git a/Documentation/git-verify-pack.txt b/Documentation/git-verify-pack.txt
new file mode 100644
--- /dev/null
+++ b/Documentation/git-verify-pack.txt
@@ -0,0 +1,38 @@
+git-verify-pack(1)
+==================
+v0.1, June 2005
+
+NAME
+----
+git-verify-pack - Validate packed GIT archive files.
+
+
+SYNOPSIS
+--------
+'git-verify-pack' <pack>.idx ...
+
+
+DESCRIPTION
+-----------
+Reads given idx file for packed GIT archive created with
+git-pack-objects command and verifies idx file and the
+corresponding pack file.
+
+OPTIONS
+-------
+<pack>.idx ...::
+	The idx files to verify.
+
+
+Author
+------
+Written by Junio C Hamano <junkio@cox.net>
+
+Documentation
+--------------
+Documentation by Junio C Hamano
+
+GIT
+---
+Part of the link:git.html[git] suite
+
diff --git a/Documentation/git.txt b/Documentation/git.txt
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -110,6 +110,9 @@ link:git-tar-tree.html[git-tar-tree]::
 link:git-unpack-file.html[git-unpack-file]::
 	Creates a temporary file with a blob's contents
 
+link:git-verify-pack.html[git-verify-pack]::
+	Validates packed GIT archive files
+
 The interrogate commands may create files - and you can force them to
 touch the working file set - but in general they don't
 
diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -36,7 +36,7 @@ PROG=   git-update-cache git-diff-files 
 	git-diff-helper git-tar-tree git-local-pull git-write-blob \
 	git-get-tar-commit-id git-apply git-stripspace \
 	git-cvs2git git-diff-stages git-rev-parse git-patch-id \
-	git-pack-objects git-unpack-objects
+	git-pack-objects git-unpack-objects git-verify-pack
 
 all: $(PROG)
 
@@ -45,7 +45,7 @@ install: $(PROG) $(SCRIPTS)
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
 	 tag.o date.o index.o diff-delta.o patch-delta.o entry.o \
-	 epoch.o refs.o csum-file.o
+	 epoch.o refs.o csum-file.o verify_pack.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h delta.h epoch.h csum-file.h pack.h
 
@@ -124,6 +124,7 @@ git-rev-parse: rev-parse.c
 git-patch-id: patch-id.c
 git-pack-objects: pack-objects.c
 git-unpack-objects: unpack-objects.c
+git-verify-pack: verify-pack.c
 
 git-http-pull: LIBS += -lcurl
 git-rev-list: LIBS += -lssl
diff --git a/cache.h b/cache.h
--- a/cache.h
+++ b/cache.h
@@ -246,9 +246,13 @@ extern struct packed_git {
 	unsigned int *index_base;
 	void *pack_base;
 	unsigned int pack_last_used;
+	unsigned int pack_use_cnt;
 	char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */
 } *packed_git;
 extern void prepare_packed_git(void);
+extern int use_packed_git(struct packed_git *);
+extern void unuse_packed_git(struct packed_git *);
+extern struct packed_git *add_packed_git(char *, int);
 extern int num_packed_objects(const struct packed_git *p);
 extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*);
 
diff --git a/fsck-cache.c b/fsck-cache.c
--- a/fsck-cache.c
+++ b/fsck-cache.c
@@ -6,6 +6,7 @@
 #include "tree.h"
 #include "blob.h"
 #include "tag.h"
+#include "pack.h"
 
 #define REACHABLE 0x0001
 
@@ -437,6 +438,10 @@ int main(int argc, char **argv)
 			alt_odb[j].name[-1] = '/';
 		}
 		prepare_packed_git();
+		for (p = packed_git; p; p = p->next)
+			/* verify gives error messages itself */
+			verify_pack(p); 
+
 		for (p = packed_git; p; p = p->next) {
 			int num = num_packed_objects(p);
 			for (i = 0; i < num; i++) {
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -27,4 +27,6 @@ struct pack_header {
 	unsigned int hdr_entries;
 };
 
+extern int verify_pack(struct packed_git *);
+
 #endif
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -302,7 +302,7 @@ static int check_packed_git_idx(const ch
 	index = idx_map;
 
 	/* check index map */
-	if (idx_size < 4*256 + 20)
+	if (idx_size < 4*256 + 20 + 20)
 		return error("index file too small");
 	nr = 0;
 	for (i = 0; i < 256; i++) {
@@ -327,12 +327,29 @@ static int check_packed_git_idx(const ch
 	return 0;
 }
 
-static void unuse_one_packed_git(void)
+static int unuse_one_packed_git(void)
 {
-	/* NOTYET */
+	struct packed_git *p, *lru = NULL;
+
+	for (p = packed_git; p; p = p->next) {
+		if (p->pack_use_cnt || !p->pack_base)
+			continue;
+		if (!lru || p->pack_last_used < lru->pack_last_used)
+			lru = p;
+	}
+	if (!lru)
+		return 0;
+	munmap(lru->pack_base, lru->pack_size);
+	lru->pack_base = NULL;
+	return 1;
+}
+
+void unuse_packed_git(struct packed_git *p)
+{
+	p->pack_use_cnt--;
 }
 
-static int use_packed_git(struct packed_git *p)
+int use_packed_git(struct packed_git *p)
 {
 	if (!p->pack_base) {
 		int fd;
@@ -340,28 +357,36 @@ static int use_packed_git(struct packed_
 		void *map;
 
 		pack_mapped += p->pack_size;
-		while (PACK_MAX_SZ < pack_mapped)
-			unuse_one_packed_git();
+		while (PACK_MAX_SZ < pack_mapped && unuse_one_packed_git())
+			; /* nothing */
 		fd = open(p->pack_name, O_RDONLY);
 		if (fd < 0)
-			return -1;
+			die("packfile %s cannot be opened", p->pack_name);
 		if (fstat(fd, &st)) {
 			close(fd);
-			return -1;
+			die("packfile %s cannot be opened", p->pack_name);
 		}
 		if (st.st_size != p->pack_size)
-			return -1;
+			die("packfile %s size mismatch.", p->pack_name);
 		map = mmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		if (map == MAP_FAILED)
-			return -1;
+			die("packfile %s cannot be mapped.", p->pack_name);
 		p->pack_base = map;
+
+		/* Check if the pack file matches with the index file.
+		 * this is cheap.
+		 */
+		if (memcmp((char*)(p->index_base) + p->index_size - 40,
+			   p->pack_base + p->pack_size - 20, 20))
+			die("packfile %s does not match index.", p->pack_name);
 	}
 	p->pack_last_used = pack_used_ctr++;
+	p->pack_use_cnt++;
 	return 0;
 }
 
-static struct packed_git *add_packed_git(char *path, int path_len)
+struct packed_git *add_packed_git(char *path, int path_len)
 {
 	struct stat st;
 	struct packed_git *p;
@@ -388,6 +413,7 @@ static struct packed_git *add_packed_git
 	p->next = NULL;
 	p->pack_base = NULL;
 	p->pack_last_used = 0;
+	p->pack_use_cnt = 0;
 	return p;
 }
 
@@ -671,6 +697,7 @@ static int packed_object_info(struct pac
 	unsigned long offset, size, left;
 	unsigned char *pack;
 	enum object_type kind;
+	int retval;
 
 	if (use_packed_git(p))
 		die("cannot map packed file");
@@ -681,8 +708,9 @@ static int packed_object_info(struct pac
 
 	switch (kind) {
 	case OBJ_DELTA:
-		return packed_delta_info(pack, size, left, type, sizep);
-		break;
+		retval = packed_delta_info(pack, size, left, type, sizep);
+		unuse_packed_git(p);
+		return retval;
 	case OBJ_COMMIT:
 		strcpy(type, "commit");
 		break;
@@ -699,6 +727,7 @@ static int packed_object_info(struct pac
 		die("corrupted pack file");
 	}
 	*sizep = size;
+	unuse_packed_git(p);
 	return 0;
 }
 
@@ -785,6 +814,7 @@ static void *unpack_entry(struct pack_en
 	unsigned long offset, size, left;
 	unsigned char *pack;
 	enum object_type kind;
+	void *retval;
 
 	if (use_packed_git(p))
 		die("cannot map packed file");
@@ -794,7 +824,9 @@ static void *unpack_entry(struct pack_en
 	left = p->pack_size - offset;
 	switch (kind) {
 	case OBJ_DELTA:
-		return unpack_delta_entry(pack, size, left, type, sizep);
+		retval = unpack_delta_entry(pack, size, left, type, sizep);
+		unuse_packed_git(p);
+		return retval;
 	case OBJ_COMMIT:
 		strcpy(type, "commit");
 		break;
@@ -811,12 +843,14 @@ static void *unpack_entry(struct pack_en
 		die("corrupted pack file");
 	}
 	*sizep = size;
-	return unpack_non_delta_entry(pack, size, left);
+	retval = unpack_non_delta_entry(pack, size, left);
+	unuse_packed_git(p);
+	return retval;
 }
 
 int num_packed_objects(const struct packed_git *p)
 {
-	/* See check_packed_git_idx and pack-objects.c */
+	/* See check_packed_git_idx() */
 	return (p->index_size - 20 - 20 - 4*256) / 24;
 }
 
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -127,4 +127,42 @@ test_expect_success \
     } >current &&
     diff expect current'
 
+unset GIT_OBJECT_DIRECTORY
+
+test_expect_success \
+    'verify pack' \
+    'git-verify-pack test-1.idx test-2.idx'
+
+test_expect_success \
+    'corrupt a pack and see if verify catches' \
+    'cp test-1.idx test-3.idx &&
+     cp test-2.pack test-3.pack &&
+     if git-verify-pack test-3.idx
+     then false
+     else :;
+     fi &&
+
+     cp test-1.pack test-3.pack &&
+     dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=2 &&
+     if git-verify-pack test-3.idx
+     then false
+     else :;
+     fi &&
+
+     cp test-1.pack test-3.pack &&
+     dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=7 &&
+     if git-verify-pack test-3.idx
+     then false
+     else :;
+     fi &&
+
+     cp test-1.pack test-3.pack &&
+     dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=12 &&
+     if git-verify-pack test-3.idx
+     then false
+     else :;
+     fi &&
+
+     :'
+
 test_done
diff --git a/verify-pack.c b/verify-pack.c
new file mode 100644
--- /dev/null
+++ b/verify-pack.c
@@ -0,0 +1,26 @@
+#include "cache.h"
+#include "pack.h"
+
+static int verify_one_pack(char *arg)
+{
+	struct packed_git *g = add_packed_git(arg, strlen(arg));
+	if (!g)
+		return -1;
+	return verify_pack(g);
+}
+
+int main(int ac, char **av)
+{
+	int errs = 0;
+
+	while (1 < ac) {
+		char path[PATH_MAX];
+		strcpy(path, av[1]);
+		if (verify_one_pack(path))
+			errs++;
+		else
+			printf("%s: OK\n", av[1]);
+		ac--; av++;
+	}
+	return !!errs;
+}
diff --git a/verify_pack.c b/verify_pack.c
new file mode 100644
--- /dev/null
+++ b/verify_pack.c
@@ -0,0 +1,62 @@
+#include "cache.h"
+#include "pack.h"
+
+static int verify_packfile(struct packed_git *p)
+{
+	unsigned long index_size = p->index_size;
+	void *index_base = p->index_base;
+	SHA_CTX ctx;
+	unsigned char sha1[20];
+	unsigned long pack_size = p->pack_size;
+	void *pack_base;
+	struct pack_header *hdr;
+	int nr_objects;
+
+	hdr = p->pack_base;
+	if (hdr->hdr_signature != htonl(PACK_SIGNATURE))
+		return error("Packfile signature mismatch", p->pack_name);
+	if (hdr->hdr_version != htonl(PACK_VERSION))
+		return error("Packfile version %d different from ours %d",
+			     ntohl(hdr->hdr_version), PACK_VERSION);
+	nr_objects = ntohl(hdr->hdr_entries);
+	if (num_packed_objects(p) != nr_objects)
+		return error("Packfile claims to have %d objects, "
+			     "while idx size expects %d", nr_objects,
+			     num_packed_objects(p));
+
+	SHA1_Init(&ctx);
+	pack_base = p->pack_base;
+	SHA1_Update(&ctx, pack_base, pack_size - 20);
+	SHA1_Final(sha1, &ctx);
+	if (memcmp(sha1, index_base + index_size - 40, 20))
+		return error("Packfile %s SHA1 mismatch with idx",
+			     p->pack_name);
+	if (memcmp(sha1, pack_base + pack_size - 20, 20))
+		return error("Packfile %s SHA1 mismatch with itself",
+			     p->pack_name);
+	return 0;
+}
+
+
+int verify_pack(struct packed_git *p)
+{
+	unsigned long index_size = p->index_size;
+	void *index_base = p->index_base;
+	SHA_CTX ctx;
+	unsigned char sha1[20];
+	int ret;
+
+	/* Verify SHA1 sum of the index file */
+	SHA1_Init(&ctx);
+	SHA1_Update(&ctx, index_base, index_size - 20);
+	SHA1_Final(sha1, &ctx);
+	if (memcmp(sha1, index_base + index_size - 20, 20))
+		return error("Packfile index for %s SHA1 mismatch",
+			     p->pack_name);
+
+	/* Verify pack file */
+	use_packed_git(p);
+	ret = verify_packfile(p);
+	unuse_packed_git(p);
+	return ret;
+}
------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] Add git-verify-pack command.
  2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
@ 2005-06-29 16:15                               ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29 16:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 29 Jun 2005, Junio C Hamano wrote:
> 
> *** Linus, what would be the practical/recommended/sane limit
> *** for mmap regions we would want to use?  Currently I start
> *** throwing away mapped packs when the total size of mapped
> *** packs exceeds 64MB.

Hey, anybody who ever used BK had better have had 1GB of memory for any 
real development. So 64MB is peanuts, but it sounds like a good guess.

> *** Oh, again, Linus, what is your preferred way to get a
> *** cover-letter material (like this) for a single patch?

I don't want cover-letters for single patches, or necessarily even for
short series (1-3 entries). The cover letter is more interesting for large
series, or even with small series if the early patches don't make much
sense on their own: then the cover-letter ends up being a useful place to
explain what the _sequence_ does, and why patch #2 that seems to be
totally useless and removes a feature is actually good ("we'll
re-implement it better in #5 after we've cleaned the code up").

So generally commentaries after the three dashes is good, if the 
commentary is "local", ie related not to a series. Only with non-local 
explanations does a separate [0/N] thing make sense to me.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: kernel.org and GIT tree rebuilding
@ 2005-07-03  2:51 linux
  0 siblings, 0 replies; 39+ messages in thread
From: linux @ 2005-07-03  2:51 UTC (permalink / raw)
  To: git, torvalds

> 	unsigned long size;
> 	unsigned char c;
> 
> 	c = *pack++;
> 	size = c & 15;
> 	type = (c >> 4) & 7;
> 	while (c & 0x80) {
> 		c = *pack++;
> 		size = (size << 7) + (c & 0x7f);
> 	}
> 
> or something. That's even denser.

If you're going for density, you missed something.  Try:

 	c = *pack++;
 	size = c & 15;
 	type = (c >> 4) & 7;
 	while (c & 0x80) {
 		c = *pack++;
 		size = (size << 7) + (c & 0x7f) + 16;
 	}

Encoding is most easily done in little-endian order, such as:

static unsigned
encode(unsigned char *p, unsigned long x)
{
        unsigned char *q = p;
        unsigned char buf[5];
        unsigned char *b = buf;

        while (x > 15) {
                assert(b < buf+5);
                x -= 16;
                *b++ = x & 0x7f;
                x >>= 7;
        }
        *b = x;

        while (b != buf)
                *q++ = *b-- | 0x80;
        *q++ = *b;
        return (unsigned)(q - p);
}

(You'll probably want to rewrite the above, but it's abandoned to the
public domain in any case.  Go nuts.)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
  2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
@ 2005-07-04 21:40                             ` Daniel Barkalow
  2005-07-04 21:45                               ` Junio C Hamano
  2005-07-04 21:59                               ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-04 21:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Linus Torvalds

On Wed, 29 Jun 2005, Junio C Hamano wrote:

>  - Blame/Annotate.  Does anybody have a fast and correct one

How about an option to git-rev-list to take a path, and (1) exclude any
branch where the version at that path ends up ignored in a merge and
(2) not list any revision where the version at that path is identical to a
parent?

This should give you the list of all commits which are directly
responsible for the present state of the file, which can then be formatted
as desired.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
@ 2005-07-04 21:45                               ` Junio C Hamano
  2005-07-04 21:59                               ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-07-04 21:45 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: git, Linus Torvalds

>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:

DB> On Wed, 29 Jun 2005, Junio C Hamano wrote:
>> - Blame/Annotate.  Does anybody have a fast and correct one

DB> How about an option to git-rev-list to take a path, and (1) exclude any
DB> branch where the version at that path ends up ignored in a merge and
DB> (2) not list any revision where the version at that path is identical to a
DB> parent?

DB> This should give you the list of all commits which are directly
DB> responsible for the present state of the file, which can then be formatted
DB> as desired.

Sounds close enough if you do not care about copies and
complete rewrites.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
  2005-07-04 21:45                               ` Junio C Hamano
@ 2005-07-04 21:59                               ` Linus Torvalds
  2005-07-04 22:41                                 ` Daniel Barkalow
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-07-04 21:59 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Mon, 4 Jul 2005, Daniel Barkalow wrote:
> 
> How about an option to git-rev-list to take a path, and (1) exclude any
> branch where the version at that path ends up ignored in a merge and
> (2) not list any revision where the version at that path is identical to a
> parent?

Hmm. How is that different from "git-whatchanged path", really?

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-04 21:59                               ` Linus Torvalds
@ 2005-07-04 22:41                                 ` Daniel Barkalow
  2005-07-04 23:06                                   ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-04 22:41 UTC (permalink / raw)
  To: Linus Torvalds, Junio C Hamano; +Cc: git

On Mon, 4 Jul 2005, Linus Torvalds wrote:

> On Mon, 4 Jul 2005, Daniel Barkalow wrote:
> > 
> > How about an option to git-rev-list to take a path, and (1) exclude any
> > branch where the version at that path ends up ignored in a merge and
> > (2) not list any revision where the version at that path is identical to a
> > parent?
> 
> Hmm. How is that different from "git-whatchanged path", really?

It would short-circuit going up areas of the history which don't
contribute (i.e., lead up to a merge which took its version from a
different parent). It could also stop when it ran out of branches that
have the file at all. Neither of these is all that significant, I guess.

Junio: what's missing from annotate/blame?

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-04 22:41                                 ` Daniel Barkalow
@ 2005-07-04 23:06                                   ` Junio C Hamano
  2005-07-05  1:54                                     ` Daniel Barkalow
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-07-04 23:06 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, git

>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:

DB> Junio: what's missing from annotate/blame?

Which one are you talking about?

What I use to generate http://members.cox.net/junkio/Summary.txt
is an implementation of an algorithm I consider "complete" in
that it does rename/copy and complete rewrite correctly.  What
is missing from the implementation is efficiency.

------------
#!/usr/bin/perl -w

use strict;

package main;
$::debug = 0;

sub read_blob {
    my $sha1 = shift;
    my $fh = undef;
    my $result;
    local ($/) = undef;
    open $fh, '-|', 'git-cat-file', 'blob', $sha1
	or die "cannot read blob $sha1";
    $result = join('', <$fh>);
    close $fh
	or die "failure while closing pipe to git-cat-file";
    return $result;
}

sub read_diff_raw {
    my ($parent, $filename) = @_;
    my $fh = undef;
    local ($/) = "\0";
    my @result = (); 
    my ($meta, $status, $sha1_1, $sha1_2, $file1, $file2);

    print STDERR "* diff-cache --cached $parent $filename\n" if $::debug;
    my $has_changes = 0;
    open $fh, '-|', 'git-diff-cache', '--cached', '-z', $parent, $filename
	or die "cannot read git-diff-cache $parent $filename";
    while (defined ($meta = <$fh>)) {
	$has_changes = 1;
    }
    close $fh
	or die "failure while closing pipe to git-diff-cache";
    if (!$has_changes) {
	return ();
    }

    $fh = undef;
    print STDERR "* diff-cache -B -C --find-copies-harder --cached $parent\n" if $::debug;
    open($fh, '-|', 'git-diff-cache', '-B', '-C', '--find-copies-harder',
	 '--cached', '-z', $parent)
	or die "cannot read git-diff-cache with $parent";
    while (defined ($meta = <$fh>)) {
	chomp($meta);
	(undef, undef, $sha1_1, $sha1_2, $status) = split(/ /, $meta);
	$file1 = <$fh>;
	chomp($file1);
	if ($status =~ /^[CR]/) {
	    $file2 = <$fh>;
	    chomp($file2);
	} elsif ($status =~ /^D/) {
	    next;
	} else {
	    $file2 = $file1;
	}
	if ($file2 eq $filename) {
	    push @result, [$status, $sha1_1, $sha1_2, $file1, $file2];
	}
    }
    close $fh
	or die "failure while closing pipe to git-diff-cache";
    return @result;
}

sub write_temp_blob {
    my ($sha1, $temp) = @_;
    my $fh = undef;
    my $blob = read_blob($sha1);
    open $fh, '>', $temp
	or die "cannot open temporary file $temp";
    print $fh $blob;
    close($fh);
}

package Git::Patch;
sub new {
    my ($class, $sha1_1, $sha1_2) = @_;
    my $self = bless [], $class;
    my $fh = undef;
    ::write_temp_blob($sha1_1, "/tmp/blame-$$-1");
    ::write_temp_blob($sha1_2, "/tmp/blame-$$-2");
    open $fh, '-|', 'diff', '-u0', "/tmp/blame-$$-1", "/tmp/blame-$$-2"
	or die "cannot read diff";
    while (<$fh>) {
	if (/^\@\@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? \@\@/) {
	    push @$self, [$1, (defined $2 ? $2 : 1),
			  $3, (defined $4 ? $4 : 1)];
	}
    }
    close $fh;
    unlink "/tmp/blame-$$-1", "/tmp/blame-$$-2";
    return $self;
}

sub find_parent_line {
    my ($self, $commit_lineno) = @_;
    my $ofs = 0;
    for (@$self) {
	my ($line_1, $len_1, $line_2, $len_2) = @$_;
	if ($commit_lineno < $line_2) {
	    return $commit_lineno - $ofs;
	}
	if ($line_2 <= $commit_lineno && $commit_lineno < $line_2 + $len_2) {
	    return -1; # changed by commit.
	}
	$ofs += ($len_1 - $len_2);
    }
    return $commit_lineno + $ofs;
}

package Git::Commit;

my %author_name_canon = 
('Linus Torvalds <torvalds@ppc970.osdl.org.(none)>' =>
 'Linus Torvalds <torvalds@ppc970.osdl.org>');

sub canon_author_name {
    my ($name) = @_;
    if (exists $author_name_canon{$name}) {
	return $author_name_canon{$name};
    }
    return $name;
}

sub new {
    my $class = shift;
    my $self = bless {
	PARENT => [],
	TREE => undef,
	AUTHOR => undef,
	COMMITTER => undef,
    }, $class;
    my $commit_sha1 = shift;
    $self->{SHA1} = $commit_sha1;
    my $fh = undef;
    open $fh, '-|', 'git-cat-file', 'commit', $commit_sha1
	or die "cannot read commit object $commit_sha1";
    while (<$fh>) {
	chomp;
	if (/^tree ([0-9a-f]{40})$/) { $self->{TREE} = $1; }
	elsif (/^parent ([0-9a-f]{40})$/) { push @{$self->{PARENT}}, $1; }
	elsif (/^author ([^>]+>)/) {
	    $self->{AUTHOR} = canon_author_name($1);
	}
	elsif (/^committer ([^>]+>)/) {
	    $self->{COMMITTER} = canon_author_name($1);
	}
    }
    close $fh
	or die "failure while closing pipe to git-cat-file";
    return $self;
}

sub find_file {
    my ($commit, $path) = @_;
    my $result = undef;
    my $fh = undef;
    local ($/) = "\0";
    open $fh, '-|', 'git-ls-tree', '-z', '-r', '-d', $commit->{TREE}, $path
	or die "cannot read git-ls-tree $commit->{TREE}";
    while (<$fh>) {
	chomp;
	if (/^[0-7]{6} blob ([0-9a-f]{40})	(.*)$/) {
	    if ($2 ne $path) {
		die "$2 ne $path???";
	    }
	    $result = $1;
	    last;
	}
    }
    close $fh
	or die "failure while closing pipe to git-ls-tree";
    return $result;
}

package Git::Blame;
sub new {
    my $class = shift;
    my $self = bless {
	LINE => [],
	UNKNOWN => undef,
	WORK => [],
    }, $class;
    my $commit = shift;
    my $filename = shift;
    my $sha1 = $commit->find_file($filename);
    my $blob = ::read_blob($sha1);
    my @blob = (split(/\n/, $blob));
    for (my $i = 0; $i < @blob; $i++) {
	$self->{LINE}[$i] = +{
	    COMMIT => $commit,
	    FOUND => undef,
	    FILENAME => $filename,
	    LINENO => ($i + 1),
	};
    }
    $self->{UNKNOWN} = scalar @blob;
    push @{$self->{WORK}}, [$commit, $filename];
    return $self;
}

sub read_blame_cache {
    my $self = shift;
    my $filename = shift;
    my $fh = undef;
    my $pi = $self->{'PATHINFO'} = {};
    open $fh, '<', $filename;
    while (<$fh>) {
	chomp;
	my ($commit, $parent, $path) = split(/\t/, $_);
	$pi->{$path}{$commit}{$parent} = 1;
    }
    close $fh;
}

sub print {
    my $self = shift;
    my $line_termination = shift;
    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	my $l = $self->{LINE}[$i];
	print ($l->{FOUND} ? ':' : '?');;
	print "$l->{COMMIT}->{SHA1}	";
	print "$l->{COMMIT}->{AUTHOR}	";
	print "$l->{COMMIT}->{COMMITTER}	";
	print "$l->{LINENO}	$l->{FILENAME}";
	print $line_termination;
    }
}

sub take_responsibility {
    my ($self, $commit) = @_;
    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	my $l = $self->{LINE}[$i];
	if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
	    $l->{FOUND} = 1;
	    $self->{UNKNOWN}--;
	}
    }
}

sub blame_parent {
    my ($self, $commit, $parent, $filename) = @_;
    my @diff = ::read_diff_raw($parent->{SHA1}, $filename);
    my $filename_in_parent;
    my $passed_blame_to_parent = undef;
    if (@diff == 0) {
	# We have not touched anything.  Blame parent for everything
	# that we are suspected for.
	for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	    my $l = $self->{LINE}[$i];
	    if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
		$l->{COMMIT} = $parent;
		$passed_blame_to_parent = 1;
	    }
	}
	$filename_in_parent = $filename;
    }
    elsif (@diff != 1) {
	# This should not happen.
	for (@diff) {
	    print "** @$_\n";
	}
	die "Oops";
    }
    else {
	my ($status, $sha1_1, $sha1_2, $file1, $file2) = @{$diff[0]};
	print STDERR "** $status $file1 $file2\n" if $::debug;
	if ($status =~ /N/ || $status =~ /M[0-9][0-9]/) {
	    # Either some of other parents created it, or we did.
	    # At this point the only thing we know is that this
	    # parent is not responsible for it.
	    ;
	}
	else {
	    my $patch = Git::Patch->new($sha1_1, $sha1_2);
	    $filename_in_parent = $file1;
	    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
		my $l = $self->{LINE}[$i];
		if (! $l->{FOUND} && $l->{COMMIT}->{SHA1} eq $commit->{SHA1}) {
		    # We are suspected to have introduced this line.
		    # Does it exist in the parent?
		    my $lineno = $l->{LINENO};
		    my $parent_line = $patch->find_parent_line($lineno);
		    if ($parent_line < 0) {
			# No, we may be the guilty ones, or some other
			# parent might be.  We do not assign blame to
			# ourselves here yet.
			;
		    }
		    else {
			# This line is coming from the parent, so pass
			# blame to it.
			$l->{COMMIT} = $parent;
			$l->{FILENAME} = $file1;
			$l->{LINENO} = $parent_line;
			$passed_blame_to_parent = 1;
		    }
		}
	    }
	}
    }
    if ($passed_blame_to_parent && $self->{UNKNOWN}) {
	unshift @{$self->{WORK}},
	[$parent, $filename_in_parent];
    }
}

sub assign {
    my ($self, $commit, $filename) = @_;
    # We do read-tree of the current commit and diff-cache
    # with each parents, instead of running diff-tree.  This
    # is because diff-tree does not look for copies hard enough.

    if (exists $self->{'PATHINFO'} && exists $self->{'PATHINFO'}{$filename} &&
	!exists $self->{'PATHINFO'}{$filename}{$commit->{SHA1}} &&
	@{$commit->{PARENT}} == 1) {
	# This commit did not touch the path at all, and
	# has only one parent.  It is all that parent's fault.

	my $parent = Git::Commit->new($commit->{PARENT}[0]);
	my $passed_blame_to_parent = 0;
	for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	    my $l = $self->{LINE}[$i];
	    if (! $l->{FOUND} &&
		($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
		$l->{COMMIT} = $parent;
		$passed_blame_to_parent = 1;
	    }
	}
	if ($passed_blame_to_parent && $self->{UNKNOWN}) {
	    unshift @{$self->{WORK}},
	    [$parent, $filename];
	}
	return;
    }

    print STDERR "* read-tree  $commit->{SHA1}\n" if $::debug;
    system('git-read-tree', '-m', $commit->{SHA1});
    for my $parent (@{$commit->{PARENT}}) {
	$self->blame_parent($commit, Git::Commit->new($parent), $filename);
    }
    $self->take_responsibility($commit);
}

sub assign_blame {
    my ($self) = @_;
    while ($self->{UNKNOWN} && @{$self->{WORK}}) {
	my $wk = shift @{$self->{WORK}};
	my ($commit, $filename) = @$wk;
	$self->assign($commit, $filename);
    }
}



################################################################
package main;
my $usage = "blame [-z] <commit> filename";
my $line_termination = "\n";

$::ENV{GIT_INDEX_FILE} = "/tmp/blame-$$-index";
unlink($::ENV{GIT_INDEX_FILE});

if ($ARGV[0] eq '-z') {
    $line_termination = "\0";
    shift;
}

if (@ARGV != 2) {
    die $usage;
}

my $head_commit = Git::Commit->new($ARGV[0]);
my $filename = $ARGV[1];
my $blame = Git::Blame->new($head_commit, $filename);
if (-f ".blame-cache") {
    $blame->read_blame_cache(".blame-cache");
}

$blame->assign_blame();
$blame->print($line_termination);

unlink($::ENV{GIT_INDEX_FILE});

__END__

How does this work, and what do we do about merges?

The algorithm considers that the first parent is our main line of
development and treats it somewhat special than other parents.  So we
pass on the blame to the first parent if a line has not changed from
it.  For lines that have changed from the first parent, we must have
either inherited that change from some other parent, or it could have
been merge conflict resolution edit we did on our own.

The following picture illustrates how we pass on and assign blames.

In the sample, the original O was forked into A and B and then merged
into M.  Line 1, 2, and 4 did not change.  Line 3 and 5 are changed in
A, and Line 5 and 6 are changed in B.  M made its own decision to
resolve merge conflicts at Line 5 to something different from A and B:

                A: 1 2 T 4 T 6
               /               \ 
O: 1 2 3 4 5 6                  M: 1 2 T 4 M S
               \               / 
                B: 1 2 3 4 S S

In the following picture, each line is annotated with a blame letter.
A lowercase blame (e.g. "a" for "1") means that commit or its ancestor
is the guilty party but we do not know which particular ancestor is
responsible for the change yet.  An uppercase blame means that we know
that commit is the guilty party.

First we look at M (the HEAD) and initialize Git::Blame->{LINE} like
this:

             M: 1 2 T 4 M S
                m m m m m m

That is, we know all lines are results of modification made by some
ancestor of M, so we assign lowercase 'm' to all of them.

Then we examine our first parent A.  Throughout the algorithm, we are
always only interested in the lines we are the suspect, but this being
the initial round, we are the suspect for all of them.  We notice that
1 2 T 4 are the same as the parent A, so we pass the blame for these
four lines to A.  M and S are different from A, so we leave them as
they are (note that we do not immediately take the blame for them):

             M: 1 2 T 4 M S
                a a a a m m

Next we go on to examine parent B.  Again, we are only interested in
the lines we are still the suspect (i.e. M and S).  We notice S is
something we inherited from B, so we pass the blame on to it, like
this:

             M: 1 2 T 4 M S
                a a a a m b

Once we exhausted the parents, we look at the results and take
responsibility for the remaining ones that we are still the suspect:

             M: 1 2 T 4 M S
                a a a a M b

We are done with M.  And we know commits A and B need to be examined
further, so we do them recursively.  When we look at A, we again only
look at the lines that A is the suspect:

             A: 1 2 T 4 T 6
                a a a a M b

Among 1 2 T 4, comparing against its parent O, we notice 1 2 4 are
the same so pass the blame for those lines to O:

             A: 1 2 T 4 T 6
                o o a o M b

A is a non-merge commit; we have already exhausted the parents and
take responsibility for the remaining ones that A is the suspect:

             A: 1 2 T 4 T 6
                o o A o M b

We go on like this and the final result would become:

             O: 1 2 3 4 5 6
                O O A O M B

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-04 23:06                                   ` Junio C Hamano
@ 2005-07-05  1:54                                     ` Daniel Barkalow
  2005-07-05  6:24                                       ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-05  1:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Mon, 4 Jul 2005, Junio C Hamano wrote:

> >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
> 
> DB> Junio: what's missing from annotate/blame?
> 
> Which one are you talking about?
> 
> What I use to generate http://members.cox.net/junkio/Summary.txt
> is an implementation of an algorithm I consider "complete" in
> that it does rename/copy and complete rewrite correctly.  What
> is missing from the implementation is efficiency.

[perl script]

> How does this work, and what do we do about merges?

I've got that part, but I'm not clear on how the rename/copy and complete
rewrite stuff works.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-05  1:54                                     ` Daniel Barkalow
@ 2005-07-05  6:24                                       ` Junio C Hamano
  2005-07-05 13:34                                         ` Marco Costalba
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-07-05  6:24 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, git

>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:

DB> [perl script]

>> How does this work, and what do we do about merges?

DB> I've got that part, but I'm not clear on how the rename/copy and complete
DB> rewrite stuff works.

Rename/copy is simply about what files to use when comparing
between two trees.  If you are tracking file F and find that a
commit created that file by renaming what was originally G, then
you compare old G and this F and assign blame for common lines
to the ancestor of that commit (that had those lines in G), and
take responsibility of the rest yourself (or assign blame for
them to other parents).

Rewrite is easier.  If diff-tree -B says it is a complete
rewrite, instead of assigning blame for lines that you are the
current suspect to any of your parents, you take blame for all
of them yourself, without comparing the file with the parent to
assign blame line-by-line.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Last mile for 1.0 again
  2005-07-05  6:24                                       ` Junio C Hamano
@ 2005-07-05 13:34                                         ` Marco Costalba
  0 siblings, 0 replies; 39+ messages in thread
From: Marco Costalba @ 2005-07-05 13:34 UTC (permalink / raw)
  To: Junio C Hamano, Daniel Barkalow; +Cc: Linus Torvalds, git



--- Junio C Hamano <junkio@cox.net> wrote:

> >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
> 
> DB> [perl script]
> 
> >> How does this work, and what do we do about merges?
> 

Checking diffs of all the parents can be computational expensive. 

I'am developing a different alghoritm for qgit, where I know,
before to run annotate all the revs that changed a given file.

The revs are saved in the shaHist list according to git-rev-list --merge-order.

The algoritm operates form the oldest to the newst calculating annotation for each 
following rev.

This alghoritm takes advantage of the fact that mostly revs belong to 2 categories:

1) revs of the same development-sequence (e.g. connected without merges)

2) empty merges, i.e. merges where the file we are interested in
 is listed in only one merging branch

If a rev in shaHist does not belong to above categories a reach analisys must be performed
in order to find the direct ancestor among all the previous revs in the list.


So starting from shaHist a ReachList rl is populated by getReachability(): 


void Annotate::getReachability(ReachList& rl, const QString& file, const QStringList& shaHist) {

	rl.clear();
	QStringList::const_iterator it = shaHist.begin();
	Rev* r = revLookup(*it);
	rl.append(ReachInfo(*it, r->id, INITIAL));
	for (it++; it != shaHist.end(); it++) {
		
		Rev* r = revLookup(*it);

		// initial revision case
		if (r->parents.count() == 0) {
			rl.append(ReachInfo(*it, r->id, INITIAL));
			continue;
		}
		// merge case
		if (r->parents.count() > 1) { 

			// check empty merges diffing against previous in list
			QString prevSha = rl.last().sha;
			QString d = getFileDiff(*it, prevSha, file);

			if (d.isEmpty()) { 
				rl.append(ReachInfo(*it, r->id, EMPTY_MERGE));
				rl.last().roots.append(prevSha);
			} else {
				// real merge: we need reach test.
				QStringList roots;
				roots.clear();
				for (uint i = 0; i < r->parents.count(); i++) {

					// here tree walking is performed
					QString root = getRoot(r->parents[i], rl);				
					if (!root.isEmpty())
						roots.append(root);
				}
				rl.append(ReachInfo(*it, r->id, MERGE));
				rl.last().roots = roots;
				}
			}
			continue;
		}
		// normal case: a revision with a single parent
		if (r->id == rl.last().id) { // same development sequence

			QString prevSha = rl.last().sha;
			rl.append(ReachInfo(*it, r->id, SAME_BR)); // same branch
			rl.last().roots.append(prevSha);

		} else { // different development sequence

			QString root = getRoot(*it, rl); // <-- here tree walking is performed
			if (!root.isEmpty()) {
				rl.append(ReachInfo(*it, r->id, DIFF_BR));
				rl.last().roots.append(root);
			} else
				qDebug("ASSERT getReachability: rev %s not reachable",
					(*it).latin1());
		}
	}
}


Then ReachList rl is then used to compute correct annotations:


void Annotate::run(const QString& file, const QStringList& shaHist, AnnotateHistory& ah) {

	QString d, diffTarget;
	QStringList t;
	int pos;
	ReachList rl;
	ah.clear();

	getReachability(rl, file, shaHist);  // <-- here ReachList rl is calculated

	for (uint i = 0; i < rl.count(); i++) {

		switch(rl[i].type) { // <-- here ReachList rl is used to find annotations

		case SAME_BR:
		case DIFF_BR:
			diffTarget = rl[i].roots.first();
			d = getFileDiff(rl[i].sha, diffTarget, file);
			pos = shaHist.findIndex(diffTarget);
			ah.append(processDiff(d, ah[pos], getAuthor(rl[i].sha, shaHist)));
			break;
		case INITIAL:
			pos = shaHist.findIndex(rl[i].sha);
			ah.append(getFirstAnnotation(file, shaHist, pos));
			break;
		case EMPTY_MERGE:
			diffTarget = rl[i].roots.first();
			pos = shaHist.findIndex(diffTarget);
			t = ah[pos]; // copy annotation from previous
			ah.append(t);
			break;
		case MERGE:
			// append annotation calculated on first root
			diffTarget = rl[i].roots.first();
			d = getFileDiff(rl[i].sha, diffTarget, file);
			pos = shaHist.findIndex(diffTarget);
			ah.append(processDiff(d, ah[pos], "Merge"));

			// update annotation with all the others roots
			for (uint j = 1; j < rl[i].roots.count(); j++) {

				diffTarget = rl[i].roots[j];
				d = getFileDiff(rl[i].sha, diffTarget, file);
				pos = shaHist.findIndex(diffTarget);

				// create an annotation calculated between root[i] and
				// the merge.
				QStringList scndAnn = processDiff(d, ah[pos],
						 getAuthor(diffTarget, shaHist));

				// finally we unify the two annotations, so we catch
				// also the case of a 3-way merge
				unify(ah.last(), scndAnn);
			}
			break;
		}
	}
}


Advantage of this algorithm is that reach analisys, i.e. tree walking, is
done, statistically, only for a small subset of revs.

Another side benefit is that correct annotation is calculated for each rev and
not only for newest one.

Of course there are issues:

1) A ordered list of revs (file history) must be known in advance. This is not a problem
   for qgit because file history is calculated while loading revs.

2) Does not takes in account file renames.


Hope this notes can help in ongoing discussion

Marco


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2005-07-05 13:35 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-25  4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25  4:40 ` Jeff Garzik
2005-06-25  5:23   ` Linus Torvalds
2005-06-25  5:48     ` Jeff Garzik
2005-06-25  6:16       ` Linus Torvalds
2005-06-26 16:41         ` Linus Torvalds
2005-06-26 18:39           ` Junio C Hamano
2005-06-26 19:19             ` Linus Torvalds
2005-06-26 19:45               ` Junio C Hamano
     [not found]                 ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
     [not found]                   ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
     [not found]                     ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28  6:58                         ` Junio C Hamano
2005-06-28  6:58                         ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28  6:59                         ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03             ` Chris Mason
2005-06-26 21:40             ` Linus Torvalds
2005-06-26 22:34               ` Linus Torvalds
2005-06-28 18:06           ` Nicolas Pitre
2005-06-28 19:28             ` Linus Torvalds
2005-06-28 21:08               ` Nicolas Pitre
2005-06-28 21:27                 ` Linus Torvalds
2005-06-28 21:55                   ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29  5:16                     ` Nicolas Pitre
2005-06-29  5:43                       ` Linus Torvalds
2005-06-29  5:54                         ` Linus Torvalds
2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15                               ` Linus Torvalds
2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45                               ` Junio C Hamano
2005-07-04 21:59                               ` Linus Torvalds
2005-07-04 22:41                                 ` Daniel Barkalow
2005-07-04 23:06                                   ` Junio C Hamano
2005-07-05  1:54                                     ` Daniel Barkalow
2005-07-05  6:24                                       ` Junio C Hamano
2005-07-05 13:34                                         ` Marco Costalba
2005-06-25  5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2005-07-03  2:51 linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).