* kernel.org and GIT tree rebuilding
@ 2005-06-25 4:20 David S. Miller
2005-06-25 4:40 ` Jeff Garzik
2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
0 siblings, 2 replies; 39+ messages in thread
From: David S. Miller @ 2005-06-25 4:20 UTC (permalink / raw)
To: git
To get a clean history to push to Linus, I typically blow
away my trees and make fresh ones to stick patches into
which I want to merge.
That mostly works fine here on my local systems, but I know this
brings the master.org mirroring system to it's knees. So what is the
generally condoned way to do stuff like this in a more friendly way?
Should I:
1) Do a git pull from Linus's tree once he takes my changes, then
ask GIT to prune the tree? How do I do that and how does it work?
2) Should I use .git/object/ database symlinking?
Are there any scripts out there which do this automatically?
Something as simple to run as "git-pull-script" and it takes
care of using links when possible on a local filesystem.
It takes sometimes an hour for my tree updates on master.kernel.org
to propagate to rsync.kernel.org so I can ask Linus to pull.
That's crazy.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller
@ 2005-06-25 4:40 ` Jeff Garzik
2005-06-25 5:23 ` Linus Torvalds
2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
1 sibling, 1 reply; 39+ messages in thread
From: Jeff Garzik @ 2005-06-25 4:40 UTC (permalink / raw)
To: David S. Miller; +Cc: git
David S. Miller wrote:
> To get a clean history to push to Linus, I typically blow
> away my trees and make fresh ones to stick patches into
> which I want to merge.
>
> That mostly works fine here on my local systems, but I know this
> brings the master.org mirroring system to it's knees. So what is the
> generally condoned way to do stuff like this in a more friendly way?
>
> Should I:
>
> 1) Do a git pull from Linus's tree once he takes my changes, then
> ask GIT to prune the tree? How do I do that and how does it work?
> It takes sometimes an hour for my tree updates on master.kernel.org
> to propagate to rsync.kernel.org so I can ask Linus to pull.
> That's crazy.
Unfortunately you cannot fix this by changing your actions. This is the
cumulative effect of all the git kernel trees on kernel.org. It now
takes over an hour for my non-git changes to propagate from master to
the mirrors, as well.
This is all due to the rsync sweeps, which have to scan metric tons of
inodes and dentries. Orders of magnitude over the pre-git days.
ftpadmin@kernel.org folks are supposedly working on an inotify-based
system, and an improved rsync application. No ETA or details.
As an aside, cold-cache, git really punishes my disks. Ted T'so noted
that it really drains laptop batteries, too.
> 2) Should I use .git/object/ database symlinking?
>
> Are there any scripts out there which do this automatically?
> Something as simple to run as "git-pull-script" and it takes
> care of using links when possible on a local filesystem.
On both kernel.org and locally, I use 'cp -al' to duplicate the initial
.git/objects directory, and then rsync (->kernel.org) or git-pull-script
(<-kernel.org) to update it after that.
That definitely helps.
Maybe somebody needs to script a relink cron job for kernel.org?
Jeff
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25 4:40 ` Jeff Garzik
@ 2005-06-25 5:04 ` Junio C Hamano
1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-25 5:04 UTC (permalink / raw)
To: David S. Miller; +Cc: git
>>>>> "DSM" == David S Miller <davem@davemloft.net> writes:
DSM> To get a clean history to push to Linus, I typically blow
DSM> away my trees and make fresh ones to stick patches into
DSM> which I want to merge.
DSM> Should I:
DSM> 1) Do a git pull from Linus's tree once he takes my changes, then
DSM> ask GIT to prune the tree? How do I do that and how does it work?
DSM> 2) Should I use .git/object/ database symlinking?
DSM> Are there any scripts out there which do this automatically?
DSM> Something as simple to run as "git-pull-script" and it takes
DSM> care of using links when possible on a local filesystem.
git-pull-script internally uses git-fetch-script which knows how
to do the local tree using hardlinks. Presumably, the following
workflow would work:
(1) You hack away in your private tree, while you keep a "to be
published" clean tree, both on your local machine.
(2) Do a GIT pull, merge in your private tree, to come up with
a clean set of changes in your private tree. This is the
tree you "typically blow away". Reordering the commits to
come up with a clean history since you last pulled from
Linus would also happen in this tree.
(3) Once you have a commit that you want to publish (i.e. the
commit chain between that commit and the point you last
pulled from Linus is the "clean history to push to Linus"),
you go to your "to be published" clean tree, and run
git-fetch-script to fetch the commit you want to publish
from your private tree. When you give an absolute path as
the "remote repo", git-local-pull with linking behaviour is
used by git-fetch-script; otherwise rsync backend is used
so you end up polluted object database. This way you copy
only the clean stuff from your private tree. Your HEAD in
this tree should be set to the commit you wanted to
publish. Running git-prune would be nicer but if your
history is truly clean it should not be necessary.
(4) Garbage collecting with git-prune your private tree is your
business.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 4:40 ` Jeff Garzik
@ 2005-06-25 5:23 ` Linus Torvalds
2005-06-25 5:48 ` Jeff Garzik
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-25 5:23 UTC (permalink / raw)
To: Jeff Garzik; +Cc: David S. Miller, git
On Sat, 25 Jun 2005, Jeff Garzik wrote:
>
> This is all due to the rsync sweeps, which have to scan metric tons of
> inodes and dentries. Orders of magnitude over the pre-git days.
Well, the real solution is to use a git-aware protocol, not rsync.
rsync is wonderful for prototyping, and I wanted to make the database
rsync'able for that reason, but it clearly doesn't scale.
I think I'll make a "pack"/"unpack" pair that just packs all the necessary
objects between two commits. Then you can basically sync the object file
by doing
git-pack OLD..NEW | ssh other-end git-unpack
and you'd basically be done. It looks pretty easy to do, too..
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 5:23 ` Linus Torvalds
@ 2005-06-25 5:48 ` Jeff Garzik
2005-06-25 6:16 ` Linus Torvalds
0 siblings, 1 reply; 39+ messages in thread
From: Jeff Garzik @ 2005-06-25 5:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David S. Miller, git
Linus Torvalds wrote:
>
> On Sat, 25 Jun 2005, Jeff Garzik wrote:
>
>>This is all due to the rsync sweeps, which have to scan metric tons of
>>inodes and dentries. Orders of magnitude over the pre-git days.
>
>
> Well, the real solution is to use a git-aware protocol, not rsync.
>
> rsync is wonderful for prototyping, and I wanted to make the database
> rsync'able for that reason, but it clearly doesn't scale.
>
> I think I'll make a "pack"/"unpack" pair that just packs all the necessary
> objects between two commits. Then you can basically sync the object file
> by doing
>
> git-pack OLD..NEW | ssh other-end git-unpack
>
> and you'd basically be done. It looks pretty easy to do, too..
The problem is kernel.org mirroring, not individual pushes and pulls,
really.
Would git-pack be the best solution for mirroring a bunch of git trees?
Jeff
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 5:48 ` Jeff Garzik
@ 2005-06-25 6:16 ` Linus Torvalds
2005-06-26 16:41 ` Linus Torvalds
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-25 6:16 UTC (permalink / raw)
To: Jeff Garzik; +Cc: David S. Miller, git
On Sat, 25 Jun 2005, Jeff Garzik wrote:
>
> The problem is kernel.org mirroring, not individual pushes and pulls,
> really.
>
> Would git-pack be the best solution for mirroring a bunch of git trees?
No guarantees, but here's a rough plan:
- I just committed a fairly trivial change to add a "--objects" flag to
git-rev-list, which allows you to basically say "I want to see the
difference not just in commit ID's, but also trees and blobs"
What does that mean? It means that in a mirroring schenario, you can,
for each git tree, do:
(a) On the slave:
cat .git/refs/*/* | sort | uniq > slave-ref-list
(b) On the master:
cat .git/refs/*/* | sort | uniq > master-ref-list
(c) On the master:
cmp $master-ref-list $slave-ref-list && exit 1
list=$(cat master-ref-list)
for i in $(cat slave-ref-list)
do
list=$list ^$i
done
git-rev-list --objects $list
and now that "git-rev-list" will list every object that needs to be
copied from the master to the slave. No need to read huge directories
etc, you get the list computed for you.
yeah, it clearly needs some refining to be useful, but I think you can
kind of see how it would work.
Now, the secondary advantage of this is that once you don't use rsync as
the mirroring method, you can now change the filesystem object database
layout. In particular, the packing thing that Chris Mason was working on
at some point suddenly becomes a lot more viable.
(In fact, more than that. You can make a single packed blob for all
"historical" objects, and that also gives you an efficient archive format
- if you're not required to have the full filesystem layout, you could
have a much more efficient packing that you basically do once a week or
something, so that you only keep the last week in the regular "one file
per object" format).
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-25 6:16 ` Linus Torvalds
@ 2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 16:41 UTC (permalink / raw)
To: Jeff Garzik; +Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason
On Fri, 24 Jun 2005, Linus Torvalds wrote:
>
> yeah, it clearly needs some refining to be useful, but I think you can
> kind of see how it would work.
Ok, here's how it works.
- Pick a starting commit (or a hundred)
- Pick an ending commit (or a hundred)
- generate the list of objects in between them
git-rev-list --object end ^start > object-list
- Pack that list of objects into an "object pack":
git-pack-objects out < object-list
(This actually generates two files: "out.idx" is the index file,
"out.pack" is the data file, but I'll make it concatenate the two at
some point)
- move the pack-files over somewhere else
- unpack them
git-unpack-objects out
and you're done.
Now, the reason I use "pack" and "unpack" instead of just "tar" to
transport the objects is that this allows me to do a fairly efficient
packing. I wanted these pack-files to be independent (ie they do _not_
depend on any objects outside of the pack-file), but within the objects
described in the pack I cna do delta-compression.
Now, that doesn't much help for small updates (where the objects are just
unrelated and have no deltas), but it helps increasingly for big ones. The
biggest one obviously being the whole path from the start to the HEAD..
For example, the "du -sh .git/objects" for the git project itself is 17MB
for me, and I can do:
torvalds@ppc970:~/git> du -sh .git/objects
17M .git/objects
torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects out
Packing 3656 objects
real 0m3.779s
user 0m3.169s
sys 0m0.602s
torvalds@ppc970:~/git> ls -lh out.*
-rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:12 out.idx
-rw-rw-r-- 1 torvalds torvalds 2.0M Jun 26 09:12 out.pack
ie it packs down to a nice 2MB pack-file with a small index. Move that
over to somewhere else, and unpack it, and you'll get all the regular
objects (it doesn't move tags and refs over, you'll have to do that
outside of the packing).
Now, you can trade off some packing time to get a better pack:
torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 out
Packing 3656 objects
real 0m11.953s
user 0m11.294s
sys 0m0.663s
torvalds@ppc970:~/git> ls -lh out.*
-rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:14 out.idx
-rw-rw-r-- 1 torvalds torvalds 1.6M Jun 26 09:14 out.pack
and if you want to allow deep delta chains (the default delta depth
limiting is 10), you can get even better results:
torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 --depth=100 out
Packing 3656 objects
real 0m12.374s
user 0m11.704s
sys 0m0.659s
torvalds@ppc970:~/git> ls -lh out.*
-rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:16 out.idx
-rw-rw-r-- 1 torvalds torvalds 1.3M Jun 26 09:16 out.pack
but then unpacking will slightly heavier.
(Doing the same for the kernel is obviously much more expensive just
because the kernel is so much bigger. A big delta discovery window like
100 takes about fifteen minutes to pack on my machine, but gets the
current kernel archive down to 70MB or so. That's ok for a monthly "pack
all the objects" to keep size requirements down, but you clearly don't
want to do this all the time ;).
Now, perhaps the more interesting part is that I also designed the pack
format so that it should be a good "history" format, not just a way to
move objects from one place to the other. Ie if you worry about diskspace,
you can pack everything up to the now into one big pack, and then remove
the original objects.
Don't do that yet, btw - I haven't actually written the code to read stuff
out of packs if we don't find it in the object directory yet, but the
layout is such that it should be straightforward and pretty efficient (but
there a deep delta chain obviously _will_ cause a performance hit).
I actually like this approach better than having delta-objects in the
filesystem. Partly because the pack-file is self-contained, partly because
it also solves the fs blocking issue, yet is still efficient to look up
the results without having hardlinks etc to duplicate objects virtually.
And when you do the packing by hand as an "archival" mechanism, it also
doesn't have any of the downsides that Chris' packing approach had.
Nico? Chris? Interested in giving it a look? It's kind of a combination of
your things, generalized and then made to have fast lookup with the index.
Fast lookup doesn't matter for a normal unpack, of course, and if I just
always wanted to unpack all the objects (ie just an object transfer
mechanism) I'd have made the index be a toposort of the objects. But
because I wanted to be able to use it as an archival format, I needed it
to be "random-access" by object name. So the index is in fact a binary
tree (well, sorted array, so the lookup degenerates into a binary search)
with a top-level index splitting up the contents based on the first byte
(the same way the filesystem layout does).
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 16:41 ` Linus Torvalds
@ 2005-06-26 18:39 ` Junio C Hamano
2005-06-26 19:19 ` Linus Torvalds
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-28 18:06 ` Nicolas Pitre
2 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-26 18:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason
>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
LT> I actually like this approach better than having delta-objects in the
LT> filesystem. Partly because the pack-file is self-contained, partly because
LT> it also solves the fs blocking issue, yet is still efficient to look up
LT> the results without having hardlinks etc to duplicate objects virtually.
LT> And when you do the packing by hand as an "archival" mechanism, it also
LT> doesn't have any of the downsides that Chris' packing approach had.
After analyzing what is involved in making packed GIT integrated
into read_sha1_file() [*1*], I agree 100% with the above. I
mean no disrespect to what Nico has done (and I myself have done
some code to work with Nico's deltified objects when I did diffs
and pull fixes), but it would help the code very much if we do
not have to worry about "delta" objects in GIT_OBJECT_DIRECTORY.
My preference is to do things in this order:
(0) concatenate pack and idx files;
(1) teach read_sha1_file() to read from packed GIT;
(2) teach fsck-cache about packed GIT;
(3) have people with deltified repositories convert them back
to undeltified (I think git-pack-objects would barf on such
repository);
(4) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means
that git-deltafy-script and git-mkdelta have to go.
(5) tell git-*-pull about packed GIT;
[Footnotes]
*1* Here is the analysis I did last night, still assuming that
we would support "delta" objects in GIT_OBJECT_DIRECTORY. The
"trickier" map_sha1_file() users almost all involve "delta"
objects, and that is why I prefer dropping them.
- Enhance GIT_ALTERNATE_OBJECT_DIRECTORIES mechanism so that
its component can be either a directory or a packed file.
- sha1_file.c::find_sha1_file() has to be enhanced to express
not just path (in the current "individual object file"
case) but a pointer to a structure that describes a packed
file in the GIT_ALTERNATE_OBJECT_DIRECTORIES list with the
offset for the entry.
- The change necessary to sha1_file.c::has_sha1_file() is
minimum. find_sha1_file() updated along the above lines
would say if the thing exists or not anyway, so it can just
return true/false as it currently does pretty easily.
- sha1_file.c::read_sha1_file() would be the primary piece to
unpack from the packed representation.
- sha1_file.c::map_sha1_file() is trickier. It has handful
callers outside sha1_file.c for valid reasons, so we will
need to audit the callers and have them fall back on
read_sha1_file() as appropriate. Here is the result of my
first pass:
- (easy) sha1_delta_base() is used only when an object is
delitified, and if true get to the base object. We can
just tell the caller our object is not deltified when it
resides in a packed file.
- (easy) sha1_file_size() is used by diffcore to measure the
expanded blob size. Although the implementation obviously
has to be different, it would be trivial to find the size
if the object resides in a packed file.
- (easy) pack-objects.c::check_object() uses map_sha1_file() so that
it can unpack small to get the type of the object. We
should be able to introduce a new interface (say,
sha1_file.c::sha1_object_type()) for doing this sort of
stuff.
- (harder) mkdelta.c::get_buffer(), object.c::parse_object()
and delta.c::process_delta() are trickier, because they
want to treat "delta" as a raw object (otherwise we would
have just done sha1_read_file() instead of
map/unpack_sha1_file pair).
- (harder) ssh-push.c::serve_object() also wants raw
representation to directly ship to the other end.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 18:39 ` Junio C Hamano
@ 2005-06-26 19:19 ` Linus Torvalds
2005-06-26 19:45 ` Junio C Hamano
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 19:19 UTC (permalink / raw)
To: Junio C Hamano
Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason
On Sun, 26 Jun 2005, Junio C Hamano wrote:
>
> My preference is to do things in this order:
>
> (0) concatenate pack and idx files;
Actually, I was originally planning to do that, but now that I have
thought about what read_sha1_file() would actually do, I think it's more
efficient to leave the index as a separate file.
In particular, what you'd normally do is that if you can't look up the
file in the regular object directory, you start going through the pack
files. You can do it by having GIT_ALTERNATE_OBJECT_DIRECTORIES point to a
pack file, but I actually would prefer the notion of just adding a
.git/objects/pack
subdirectory, and having object lookup just automatically open and map all
index files in that subdirectory.
And the thing is, you really just want to map the index files, the data
files can be so big that you can't afford to map them (ie a really big
project might have several pack-files a gig each or something like that).
And the most efficient way to map just the index file is to keep it
separate, because then the "stat()" will just get the information
directly, and you then just mmap that.
The alternative is to first read the index of the index (to figure out how
big the index is), and then map the rest. But that just seems a lot
messier than just mapping the index file directly.
And when creating these things, we do need to create the data file (which
can be big enough that it doesn't fit in memory) first, so we have to have
a separate file for it, we can't just stream it out to stdout.
Now, when _sending_ the pack-files, linearizing them is easy: you just
send the index first, and the data file immediately afterwards. The index
tells how big it is, so there's no need to even add any markers: you can
do something like 'git-send-script' with something simple like
git-rev-list ... | git-pack-file tmp-pack &&
cat tmp-pack.idx tmp-pack.data | ssh other git-receive-script
So let's just keep the index/data files separate.
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 19:19 ` Linus Torvalds
@ 2005-06-26 19:45 ` Junio C Hamano
[not found] ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-26 19:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason
>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
LT> On Sun, 26 Jun 2005, Junio C Hamano wrote:
>>
>> My preference is to do things in this order:
>>
>> (0) concatenate pack and idx files;
LT> So let's just keep the index/data files separate.
Fair enough. Having thought about it a bit more, if people
agree, I think it would make more sense to rip out "delta"
object support first before doing read_sha1_file() and friends
that uses .git/objects/pack.
My "preferred order" now look like this:
(1) have people with deltified repositories convert them back
to undeltified (I think git-pack-objects would barf on such
repository);
(2) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means
that git-deltafy-script and git-mkdelta have to go.
(3) teach read_sha1_file() to read from packed GIT in
.git/objects/pack;
(4) teach fsck-cache about packed GIT;
(5) tell git-*-pull about packed GIT;
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
@ 2005-06-26 20:52 ` Chris Mason
2005-06-26 21:03 ` Chris Mason
2005-06-26 21:40 ` Linus Torvalds
2005-06-28 18:06 ` Nicolas Pitre
2 siblings, 2 replies; 39+ messages in thread
From: Chris Mason @ 2005-06-26 20:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre
On Sunday 26 June 2005 12:41, Linus Torvalds wrote:
> On Fri, 24 Jun 2005, Linus Torvalds wrote:
> > yeah, it clearly needs some refining to be useful, but I think you can
> > kind of see how it would work.
>
> Ok, here's how it works.
>
> - Pick a starting commit (or a hundred)
>
> - Pick an ending commit (or a hundred)
>
> - generate the list of objects in between them
>
> git-rev-list --object end ^start > object-list
>
> - Pack that list of objects into an "object pack":
>
> git-pack-objects out < object-list
Without having read the code, the big thing that hurt performance in my early
packed file work was compressing the whole packed file instead of individual
sub-objects. It takes more room to compress each object, but when I
compressed the whole thing read performance was quite bad.
-chris
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
@ 2005-06-26 21:03 ` Chris Mason
2005-06-26 21:40 ` Linus Torvalds
1 sibling, 0 replies; 39+ messages in thread
From: Chris Mason @ 2005-06-26 21:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre
On Sunday 26 June 2005 16:52, Chris Mason wrote:
> >
> > git-rev-list --object end ^start > object-list
> >
> > - Pack that list of objects into an "object pack":
> >
> > git-pack-objects out < object-list
>
> Without having read the code, the big thing that hurt performance in my
> early packed file work was compressing the whole packed file instead of
> individual sub-objects. It takes more room to compress each object, but
> when I compressed the whole thing read performance was quite bad.
Sorry, fat fingered the send key...
The hard links were the biggest problem with my packed file patches, I think
the dynamic lookup in a separate packed file index is the best way to go.
-chris
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03 ` Chris Mason
@ 2005-06-26 21:40 ` Linus Torvalds
2005-06-26 22:34 ` Linus Torvalds
1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 21:40 UTC (permalink / raw)
To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre
On Sun, 26 Jun 2005, Chris Mason wrote:
>
> Without having read the code, the big thing that hurt performance in my early
> packed file work was compressing the whole packed file instead of individual
> sub-objects. It takes more room to compress each object, but when I
> compressed the whole thing read performance was quite bad.
Since I wanted random-access, compressing the whole thing just wasn't an
option.
Besides, the big space savings come from finding deltas, which is
obviously also a compression, just at a higher level. The biggest problem
there is to find a guess of objects to try to delta against, and right now
that part is pretty stupid and could possibly be improved (it just sorts
objects by size and tries to delta against "close" objects).
To generate a better sort _would_ actually be pretty close to doing a
global compression (it really does boil down to the same thing: finding
big sub-sequences, except it's in a "fragmented" space), but one issue is
that I don't want to read in the whole data set in one go, so it would
have to be based on some rolling hash or something. Davide pointed to
rzip, and a variation of that (which knows about object boundaries) might
work.
(You can also sort by filename, if you want to try. I don't track
filenames at all there and it's actually non-trivial to do, so that would
require some new and pretty nasty code, but it's possible in _theory_ at
least.)
Anyway, that's all potential improvement for generating better packing,
and it should certainly be possible without changing the format - just
generate a better initial sort, in otder to find more deltas (or rather,
find them faster by using a smaller window size).
So the stupid sort I have now does actually work, but exactly because it's
so stupid it wants a big window for best packing (because there might be a
lot of objects that aren't interesting), which in turn is quite expensive.
So a better sort would make a smaller window more effective.
[ Some numbers: a window of 10 objects is the default, and packs the
current kernel down to 77MB in 2m21s. A window of 20 objects improves
that packing to 71MB, but makes the packing time go up to 3m36s for me.
And a window of 100 gets us down to 62M but takes 11m54s.
A window of 200 (with a delta depth of 200 too - likely _way_ too deep
for normal use) gives you a 59M pack, but takes 20m59s, so there's
definitely a point of diminishing returns.
This is all for the current HEAD, which takes up 264M the "traditional"
git way and takes 141M without any deltas, just packed tightly with no
filesystem blocking.
Now, as you can notice that's actually a slightly sub-linear increase in
time, because as we find a delta, we will only accept smaller deltas in
the future, so we can often stop comparing even before we've reached the
maximum window size, and so effort is slightly less than linear because
there's effectively a constant component to part of it.
Also, the good news is that you probably don't want to generate one
humungous pack archive anyway, but you're likely better off doing a new
incremental pack every few months. So we'll never have the situation
that creating a pack gets increasingly more costly, since at some point
you just say "ok, I created a perfect pack for the first 4 months of
development, I'll now do subsequent packs on top of that instead".
The other good news is that a pack is also a natural boundary for fsck
(as in "ok, I found that object in a pack, so I won't bother going
deeper in the reachability chain"), so if you start packing your
repository, fsck will only have to worry about the objects that are
unpacked. That makes them work really naturally for archiving, ie this
all means that you can avoid a lot of overhead by packing your history
every once in a while, with it all being entirely transparent.
In other words, if you just pack every month, you can basically
guarantee that fsck costs etc never really go up, and your diskspace
also goes up only very slowly. The packed format is quite efficient in
many ways, but it is totally immutable (ie you can't add anything to an
archive - a pack stays the way it always was, and if you want to pack
more you have to either re-do the pack or just create a new one) ]
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 21:40 ` Linus Torvalds
@ 2005-06-26 22:34 ` Linus Torvalds
0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-26 22:34 UTC (permalink / raw)
To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre
On Sun, 26 Jun 2005, Linus Torvalds wrote:
>
> (You can also sort by filename, if you want to try. I don't track
> filenames at all there and it's actually non-trivial to do, so that would
> require some new and pretty nasty code, but it's possible in _theory_ at
> least.)
Heh. It's actually very easy if you don't take the "name" too seriously,
and you just pick some random one, namely the first one that was used to
reach the entry.
Then, you might sort the objects on a hash based on the name, and get
tons of cheap deltas close-by.
It is _uglee_, but hey, it's a heuristic, and it happens to work pretty
well. It brought the kernel pack down to 59M even with just a small window
of 10.
Thanks to Davide for making me think about this hack, although he talked
about something much more proper (and harder) than this quick and
ugly heuristic ;)
The main change is that "git-rev-list --objects" has been changed to show
the name the object was reached through.
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH] Obtain sha1_file_info() for deltified pack entry properly.
[not found] ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
@ 2005-06-28 6:56 ` Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 6:56 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
I will be sending these three patches:
[PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly.
[PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'.
[PATCH 3/3] git-cat-file: '-s' to find out object size.
The first one is slightly different from what I sent earlier to
you privately. If you have already applied it, please apply the
4-liner alternate patch attached to this message on top of it
for the fix included in the one in this series (and drop the
first one, obviously).
The second and third patches fell out as a bonus while I was
debugging the sha1_file_info(). Especially the third one is in
"because we can do it so cheaply now", not "because I need to
have that feature" category, and I do not mind too much if you
drop it, but I suspect somebody may find it useful.
The "4-liner alternate patch" follows.
------------
Add missing use_packed_git() call.
The function sha1_object_info() was using packed GIT file
without making sure it is mapped, which resulted in
segfaulting.
We would need to introduce unuse_packed_git() call and do proper
use counting to figure out when it is safe to unmap, but
currently we do not unmap packed file yet.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
sha1_file.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -675,6 +675,10 @@ static int packed_object_info(struct pac
offset = entry->offset;
if (p->pack_size - 5 < offset)
die("object offset outside of pack file");
+
+ if (use_packed_git(p))
+ die("cannot map packed file");
+
pack = p->pack_base + offset;
size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
left = p->pack_size - offset - 5;
------------------------------------------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] Obtain sha1_file_info() for deltified pack entry properly.
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
@ 2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 6:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
[PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly.
The initial one was not doing enough to figure things out
without uncompressing too much. It also fixes a potential
segfault resulting from missing use_packed_git() call.
We would need to introduce unuse_packed_git() call and do proper
use counting to figure out when it is safe to unmap, but
currently we do not unmap packed file yet.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
sha1_file.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 69 insertions(+), 4 deletions(-)
d2f58b4aef500835489f30ac5df7985bc21e3c24
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -601,9 +601,70 @@ void * unpack_sha1_file(void *map, unsig
return unpack_sha1_rest(&stream, hdr, *size);
}
-/* Returns 0 on fast-path success, returns 1 on deltified
- * and need to unpack to see info.
- */
+static int packed_delta_info(unsigned char *base_sha1,
+ unsigned long delta_size,
+ unsigned long left,
+ char *type,
+ unsigned long *sizep)
+{
+ unsigned char *data;
+ unsigned char delta_head[64];
+ int i;
+ unsigned char cmd;
+ unsigned long data_size, result_size, base_size, verify_base_size;
+ z_stream stream;
+ int st;
+
+ if (left < 20)
+ die("truncated pack file");
+ if (sha1_object_info(base_sha1, type, &base_size))
+ die("cannot get info for delta-pack base");
+
+ data = base_sha1 + 20;
+ data_size = left - 20;
+
+ memset(&stream, 0, sizeof(stream));
+
+ stream.next_in = data;
+ stream.avail_in = data_size;
+ stream.next_out = delta_head;
+ stream.avail_out = sizeof(delta_head);
+
+ inflateInit(&stream);
+ st = inflate(&stream, Z_FINISH);
+ inflateEnd(&stream);
+ if ((st != Z_STREAM_END) && stream.total_out != sizeof(delta_head))
+ die("delta data unpack-initial failed");
+
+ /* Examine the initial part of the delta to figure out
+ * the result size. Verify the base size while we are at it.
+ */
+ data = delta_head;
+ verify_base_size = i = 0;
+ cmd = *data++;
+ while (cmd) {
+ if (cmd & 1)
+ verify_base_size |= *data++ << i;
+ i += 8;
+ cmd >>= 1;
+ }
+
+ /* Read the result size */
+ result_size = i = 0;
+ cmd = *data++;
+ while (cmd) {
+ if (cmd & 1)
+ result_size |= *data++ << i;
+ i += 8;
+ cmd >>= 1;
+ }
+ if (verify_base_size != base_size)
+ die("delta base size mismatch");
+
+ *sizep = result_size;
+ return 0;
+}
+
static int packed_object_info(struct pack_entry *entry,
char *type, unsigned long *sizep)
{
@@ -614,12 +675,16 @@ static int packed_object_info(struct pac
offset = entry->offset;
if (p->pack_size - 5 < offset)
die("object offset outside of pack file");
+
+ if (use_packed_git(p))
+ die("cannot map packed file");
+
pack = p->pack_base + offset;
size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
left = p->pack_size - offset - 5;
switch (*pack) {
case 'D':
- return 1;
+ return packed_delta_info(pack+5, size, left, type, sizep);
break;
case 'C':
strcpy(type, "commit");
------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'.
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
@ 2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 6:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
When trying to find out the type of the object, there is no need
to uncompress the whole object. Just use sha1_object_info().
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
cat-file.c | 10 ++++------
1 files changed, 4 insertions(+), 6 deletions(-)
fcfdd3d74d359af32dd30aa9f4fe373cebb3e232
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -16,13 +16,11 @@ int main(int argc, char **argv)
usage("git-cat-file [-t | tagname] <sha1>");
if (!strcmp("-t", argv[1])) {
- buf = read_sha1_file(sha1, type, &size);
- if (buf) {
- buf = type;
- size = strlen(type);
- type[size] = '\n';
- size++;
+ if (!sha1_object_info(sha1, type, &size)) {
+ printf("%s\n", type);
+ return 0;
}
+ buf = NULL;
} else {
buf = read_object_with_reference(sha1, argv[1], &size, NULL);
}
------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 3/3] git-cat-file: '-s' to find out object size.
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
@ 2005-06-28 6:59 ` Junio C Hamano
2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 6:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
We use sha1_object_info() now, and getting size is also trivial.
I admit that this is more of "because we can" not "because I see
immediate need for it", though.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
Documentation/git-cat-file.txt | 12 +++++++++---
cat-file.c | 13 ++++++++++---
2 files changed, 19 insertions(+), 6 deletions(-)
dcf2fc7609509985d4411fb13c6956bcf3c8560f
diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -9,12 +9,13 @@ git-cat-file - Provide content or type i
SYNOPSIS
--------
-'git-cat-file' (-t | <type>) <object>
+'git-cat-file' (-t | -s | <type>) <object>
DESCRIPTION
-----------
Provides content or type of objects in the repository. The type
-is required if '-t' is not being used to find the object type.
+is required unless '-t' is used to find the object type,
+or '-s' is used to find the object size.
OPTIONS
-------
@@ -25,6 +26,10 @@ OPTIONS
Instead of the content, show the object type identified by
<object>.
+-s::
+ Instead of the content, show the object size identified by
+ <object>.
+
<type>::
Typically this matches the real type of <object> but asking
for a type that can trivially dereferenced from the given
@@ -35,7 +40,8 @@ OPTIONS
OUTPUT
------
-If '-t' is specified, one of the <type>.
+If '-t' is specified, one of the <type>. If '-s' is specified,
+the size of the <object> in bytes.
Otherwise the raw (though uncompressed) contents of the <object> will
be returned.
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -13,11 +13,18 @@ int main(int argc, char **argv)
unsigned long size;
if (argc != 3 || get_sha1(argv[2], sha1))
- usage("git-cat-file [-t | tagname] <sha1>");
+ usage("git-cat-file [-t | -s | tagname] <sha1>");
- if (!strcmp("-t", argv[1])) {
+ if (!strcmp("-t", argv[1]) || !strcmp("-s", argv[1])) {
if (!sha1_object_info(sha1, type, &size)) {
- printf("%s\n", type);
+ switch (argv[1][1]) {
+ case 't':
+ printf("%s\n", type);
+ break;
+ case 's':
+ printf("%lu\n", size);
+ break;
+ }
return 0;
}
buf = NULL;
------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
@ 2005-06-28 18:06 ` Nicolas Pitre
2005-06-28 19:28 ` Linus Torvalds
2 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-28 18:06 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason
On Sun, 26 Jun 2005, Linus Torvalds wrote:
> On Fri, 24 Jun 2005, Linus Torvalds wrote:
> >
> > yeah, it clearly needs some refining to be useful, but I think you can
> > kind of see how it would work.
>
> Ok, here's how it works.
>
[...]
>
> Nico? Chris? Interested in giving it a look? It's kind of a combination of
> your things, generalized and then made to have fast lookup with the index.
Wow !
Look away for a few days and when you look back your own stuff has been
yanked out!
But actually I like this packing thing. Certainly much nicer to work
with and also more useful. And, since the pack file has its own
checksum it is possible to use that pack format (without the index
(which can be recreated from the pack file alone anyway) for network
transfer.
Here's one improvement to the pack format, breaking it early so it won't
affect anyone at this point: compressed object header. Instead of the
fixed 5 byte header, this patch convert it to a variable size granted
most object are small enough to save on the storage of the significant
size bytes which will be zero and packing the non-zero byte position
with the object type.
This can save up to 2% or 21KB on the test I
performed with the git repository.
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -7,7 +7,7 @@
static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] base-name < object-list";
enum object_type {
- OBJ_NONE,
+ OBJ_NONE = 0,
OBJ_COMMIT,
OBJ_TREE,
OBJ_BLOB,
@@ -55,7 +55,7 @@ static unsigned long write_object(struct
char type[10];
void *buf = read_sha1_file(entry->sha1, type, &size);
char header[25];
- unsigned hdrlen, datalen;
+ unsigned hdrlen, datalen, bit;
if (!buf)
die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -63,21 +63,33 @@ static unsigned long write_object(struct
die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
/*
- * The object header is a byte of 'type' followed by four bytes of
- * length, except for deltas that has the 20 bytes of delta sha
- * instead.
+ * The object header is one byte which low 4 bits represent the
+ * object type, the 4 upper bits indicates which of the following
+ * bytes are used to build the object size. For delta objects the
+ * sha1 of the reference object is also appended.
*/
- header[0] = ".CTB"[entry->type];
- hdrlen = 5;
+ header[0] = entry->type;
if (entry->delta) {
- header[0] = 'D';
- memcpy(header+5, entry->delta, 20);
+ header[0] = OBJ_DELTA;
buf = delta_against(buf, size, entry);
size = entry->delta_size;
- hdrlen = 25;
}
- datalen = htonl(size);
- memcpy(header+1, &datalen, 4);
+ hdrlen = 1;
+ bit = (1 << 4);
+ datalen = size;
+ do {
+ if (datalen & 0xff) {
+ header[0] |= bit;
+ header[hdrlen++] = datalen;
+ }
+ bit <<= 1;
+ datalen >>= 8;
+ } while (datalen);
+ if (entry->delta) {
+ memcpy(header+hdrlen, entry->delta, 20);
+ hdrlen += 20;
+ }
+
sha1write(f, header, hdrlen);
datalen = sha1write_compressed(f, buf, size);
free(buf);
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -12,6 +12,16 @@ struct pack_entry {
unsigned char sha1[20];
};
+enum object_type {
+ OBJ_NONE = 0,
+ OBJ_COMMIT,
+ OBJ_TREE,
+ OBJ_BLOB,
+ OBJ_DELTA
+};
+
+static char *type_s[] = { NULL, "commit", "tree", "blob", "delta" };
+
static void *pack_base;
static unsigned long pack_size;
static void *index_base;
@@ -92,7 +102,7 @@ static int check_index(void)
}
static int unpack_non_delta_entry(struct pack_entry *entry,
- int kind,
+ char *type,
unsigned char *data,
unsigned long size,
unsigned long left)
@@ -101,9 +111,8 @@ static int unpack_non_delta_entry(struct
z_stream stream;
char *buffer;
unsigned char sha1[20];
- char *type_s;
- printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size);
+ printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
if (dry_run)
return 0;
@@ -120,18 +129,12 @@ static int unpack_non_delta_entry(struct
inflateEnd(&stream);
if ((st != Z_STREAM_END) || stream.total_out != size)
goto err_finish;
- switch (kind) {
- case 'C': type_s = "commit"; break;
- case 'T': type_s = "tree"; break;
- case 'B': type_s = "blob"; break;
- default: goto err_finish;
- }
- if (write_sha1_file(buffer, size, type_s, sha1) < 0)
+ if (write_sha1_file(buffer, size, type, sha1) < 0)
die("failed to write %s (%s)",
- sha1_to_hex(entry->sha1), type_s);
- printf("%s %s\n", sha1_to_hex(sha1), type_s);
+ sha1_to_hex(entry->sha1), type);
+ printf("%s %s\n", sha1_to_hex(sha1), type);
if (memcmp(sha1, entry->sha1, 20))
- die("resulting %s have wrong SHA1", type_s);
+ die("resulting %s have wrong SHA1", type);
finish:
st = 0;
@@ -183,15 +186,13 @@ static int unpack_delta_entry(struct pac
die("truncated pack file");
data = base_sha1 + 20;
data_size = left - 20;
- printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+ printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
printf(" %s\n", sha1_to_hex(base_sha1));
if (dry_run)
return 0;
- /* pack+5 is the base sha1, unless we have it, we need to
- * unpack it first.
- */
+ /* unless we have the base sha1, we need to unpack it first. */
if (!has_sha1_file(base_sha1)) {
struct pack_entry *base;
if (!find_pack_entry(base_sha1, &base))
@@ -236,7 +237,9 @@ static int unpack_delta_entry(struct pac
static void unpack_entry(struct pack_entry *entry)
{
unsigned long offset, size, left;
- unsigned char *pack;
+ unsigned char *pack, sizebits;
+ enum object_type type;
+ int i;
/* Have we done this one already due to deltas based on it? */
if (lookup_object(entry->sha1))
@@ -246,14 +249,25 @@ static void unpack_entry(struct pack_ent
if (offset > pack_size - 5)
die("object offset outside of pack file");
pack = pack_base + offset;
- size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
- left = pack_size - offset - 5;
- switch (*pack) {
- case 'C': case 'T': case 'B':
- unpack_non_delta_entry(entry, *pack, pack+5, size, left);
+ sizebits = *pack++;
+ type = sizebits & 0x0f;
+ sizebits >>= 4;
+ i = size = 0;
+ while (sizebits) {
+ if (sizebits & 1)
+ size |= *pack++ << i;
+ i += 8;
+ sizebits >>= 1;
+ }
+ left = pack_size - ((void *)pack - pack_base);
+ switch (type) {
+ case OBJ_COMMIT:
+ case OBJ_TREE:
+ case OBJ_BLOB:
+ unpack_non_delta_entry(entry, type_s[type], pack, size, left);
break;
- case 'D':
- unpack_delta_entry(entry, pack+5, size, left);
+ case OBJ_DELTA:
+ unpack_delta_entry(entry, pack, size, left);
break;
default:
die("corrupted pack file");
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-28 18:06 ` Nicolas Pitre
@ 2005-06-28 19:28 ` Linus Torvalds
2005-06-28 21:08 ` Nicolas Pitre
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-28 19:28 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason
On Tue, 28 Jun 2005, Nicolas Pitre wrote:
>
> Here's one improvement to the pack format, breaking it early so it won't
> affect anyone at this point: compressed object header. Instead of the
> fixed 5 byte header, this patch convert it to a variable size granted
> most object are small enough to save on the storage of the significant
> size bytes which will be zero and packing the non-zero byte position
> with the object type.
Ok, this is against an older version and doesn't have that "read_sha1"
thing, but yes, something like this would work.
I'd prefer the encoding to be a bit different, though: make the size be
encoded in seven bits per byte, with the high bit meaning "more to come".
We can use four bits from the "type" byte for the initial value, making
lengths 0-15 be free.
unsigned long size;
unsigned char c;
c = *pack++;
type =
size = c & 15;
type = (c >> 4) & 7;
while (c & 0x80) {
c = *pack++;
size = (size << 7) + (pack & 0x7f);
}
or something. That's even denser.
However, I also end up wanting to add a "global header" to the pack-file,
that contains at least the number of objects packed. We may not know how
big the pack-file will be, but we'll at least know how many objects it
has before we start writing it.
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-28 19:28 ` Linus Torvalds
@ 2005-06-28 21:08 ` Nicolas Pitre
2005-06-28 21:27 ` Linus Torvalds
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-28 21:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason
On Tue, 28 Jun 2005, Linus Torvalds wrote:
>
>
> On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> >
> > Here's one improvement to the pack format, breaking it early so it won't
> > affect anyone at this point: compressed object header. Instead of the
> > fixed 5 byte header, this patch convert it to a variable size granted
> > most object are small enough to save on the storage of the significant
> > size bytes which will be zero and packing the non-zero byte position
> > with the object type.
>
> Ok, this is against an older version and doesn't have that "read_sha1"
> thing, but yes, something like this would work.
>
> I'd prefer the encoding to be a bit different, though: make the size be
> encoded in seven bits per byte, with the high bit meaning "more to come".
> We can use four bits from the "type" byte for the initial value, making
> lengths 0-15 be free.
>
> unsigned long size;
> unsigned char c;
>
> c = *pack++;
> type =
> size = c & 15;
> type = (c >> 4) & 7;
> while (c & 0x80) {
> c = *pack++;
> size = (size << 7) + (pack & 0x7f);
> }
>
> or something. That's even denser.
OK. New patch below.
> However, I also end up wanting to add a "global header" to the pack-file,
> that contains at least the number of objects packed. We may not know how
> big the pack-file will be, but we'll at least know how many objects it
> has before we start writing it.
Probably a version signature would be a good thing too.
=====
This patch adds compressed object header. Instead of the fixed 5 byte
header, this patch convert it to a variable size granted most object are
small enough to save on the storage of the size's most significant bytes
which are zero, even ffolding bits with the object type. Objects up to
2047 bytes in size will have a header of only 2 bytes.
Signed-off-by: Nicolas Pitre <nico@cam.org>
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -3,24 +3,17 @@
#include "object.h"
#include "delta.h"
#include "csum-file.h"
+#include "pack.h"
static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] {--stdout | base-name} < object-list";
-/*
- * The object type is a single-character shorthand:
- * - 'C' for "Commit"
- * - 'T' for "Tree"
- * - 'B' for "Blob"
- * - 'G' for "taG"
- * - 'D' for "Delta"
- */
struct object_entry {
unsigned char sha1[20];
unsigned long size;
unsigned long offset;
unsigned int depth;
unsigned int hash;
- unsigned char type;
+ packed_obj_type type;
unsigned long delta_size;
struct object_entry *delta;
};
@@ -63,21 +56,30 @@ static unsigned long write_object(struct
die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
/*
- * The object header is a byte of 'type' followed by four bytes of
- * length, except for deltas that has the 20 bytes of delta sha
- * instead.
+ * The object header first byte has its low 4 bits representing the
+ * object type, the 4 upper bits indicating which of the following
+ * bytes are used to build the object size. For delta objects the
+ * sha1 of the reference object is also appended.
*/
header[0] = entry->type;
- hdrlen = 5;
if (entry->delta) {
- header[0] = 'D';
- memcpy(header+5, entry->delta, 20);
+ header[0] = PACKED_DELTA;
buf = delta_against(buf, size, entry);
size = entry->delta_size;
- hdrlen = 25;
}
- datalen = htonl(size);
- memcpy(header+1, &datalen, 4);
+ header[0] |= size << 3;
+ hdrlen = 1;
+ datalen = size >> 4;
+ while (datalen) {
+ header[hdrlen - 1] |= 0x80;
+ header[hdrlen++] = datalen;
+ datalen >>= 7;
+ }
+ if (entry->delta) {
+ memcpy(header+hdrlen, entry->delta, 20);
+ hdrlen += 20;
+ }
+
sha1write(f, header, hdrlen);
datalen = sha1write_compressed(f, buf, size);
free(buf);
@@ -168,13 +170,13 @@ static void check_object(struct object_e
if (!sha1_object_info(entry->sha1, type, &entry->size)) {
if (!strcmp(type, "commit")) {
- entry->type = 'C';
+ entry->type = PACKED_COMMIT;
} else if (!strcmp(type, "tree")) {
- entry->type = 'T';
+ entry->type = PACKED_TREE;
} else if (!strcmp(type, "blob")) {
- entry->type = 'B';
+ entry->type = PACKED_BLOB;
} else if (!strcmp(type, "tag")) {
- entry->type = 'G';
+ entry->type = PACKED_TAG;
} else
die("unable to pack object %s of type %s",
sha1_to_hex(entry->sha1), type);
diff --git a/pack.h b/pack.h
new file mode 100644
--- /dev/null
+++ b/pack.h
@@ -0,0 +1,19 @@
+#ifndef PACK_H
+#define PACK_H
+
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
+
+typedef enum {
+ PACKED_RESERVED = 0,
+ PACKED_COMMIT = 1,
+ PACKED_TREE = 2,
+ PACKED_BLOB = 3,
+ PACKED_TAG = 4,
+ PACKED_DELTA = 7
+} packed_obj_type;
+
+#endif
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -1,6 +1,7 @@
#include "cache.h"
#include "object.h"
#include "delta.h"
+#include "pack.h"
static int dry_run;
static int nr_entries;
@@ -12,6 +13,14 @@ struct pack_entry {
unsigned char sha1[20];
};
+static char *type_s[] = {
+ [PACKED_COMMIT] = "commit",
+ [PACKED_TREE] = "tree",
+ [PACKED_BLOB] = "blob",
+ [PACKED_TAG] = "tag",
+ [PACKED_DELTA] "delta"
+};
+
static void *pack_base;
static unsigned long pack_size;
static void *index_base;
@@ -92,7 +101,7 @@ static int check_index(void)
}
static int unpack_non_delta_entry(struct pack_entry *entry,
- int kind,
+ char *type,
unsigned char *data,
unsigned long size,
unsigned long left)
@@ -101,9 +110,8 @@ static int unpack_non_delta_entry(struct
z_stream stream;
char *buffer;
unsigned char sha1[20];
- char *type_s;
- printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size);
+ printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
if (dry_run)
return 0;
@@ -120,22 +128,15 @@ static int unpack_non_delta_entry(struct
inflateEnd(&stream);
if ((st != Z_STREAM_END) || stream.total_out != size)
goto err_finish;
- switch (kind) {
- case 'C': type_s = "commit"; break;
- case 'T': type_s = "tree"; break;
- case 'B': type_s = "blob"; break;
- case 'G': type_s = "tag"; break;
- default: goto err_finish;
- }
- if (write_sha1_file(buffer, size, type_s, sha1) < 0)
+ if (write_sha1_file(buffer, size, type, sha1) < 0)
die("failed to write %s (%s)",
- sha1_to_hex(entry->sha1), type_s);
- printf("%s %s\n", sha1_to_hex(sha1), type_s);
+ sha1_to_hex(entry->sha1), type);
+ printf("%s %s\n", sha1_to_hex(sha1), type);
if (memcmp(sha1, entry->sha1, 20))
- die("resulting %s have wrong SHA1", type_s);
+ die("resulting %s have wrong SHA1", type);
- finish:
st = 0;
+ finish:
free(buffer);
return st;
err_finish:
@@ -184,15 +185,13 @@ static int unpack_delta_entry(struct pac
die("truncated pack file");
data = base_sha1 + 20;
data_size = left - 20;
- printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+ printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
printf(" %s\n", sha1_to_hex(base_sha1));
if (dry_run)
return 0;
- /* pack+5 is the base sha1, unless we have it, we need to
- * unpack it first.
- */
+ /* unless we have the base sha1, we need to unpack it first. */
if (!has_sha1_file(base_sha1)) {
struct pack_entry *base;
if (!find_pack_entry(base_sha1, &base))
@@ -237,7 +236,9 @@ static int unpack_delta_entry(struct pac
static void unpack_entry(struct pack_entry *entry)
{
unsigned long offset, size, left;
- unsigned char *pack;
+ unsigned char *pack, sizebits;
+ packed_obj_type type;
+ int i;
/* Have we done this one already due to deltas based on it? */
if (lookup_object(entry->sha1))
@@ -247,17 +248,28 @@ static void unpack_entry(struct pack_ent
if (offset > pack_size - 5)
die("object offset outside of pack file");
pack = pack_base + offset;
- size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4];
- left = pack_size - offset - 5;
- switch (*pack) {
- case 'C': case 'T': case 'B': case 'G':
- unpack_non_delta_entry(entry, *pack, pack+5, size, left);
+ sizebits = *pack++;
+ type = sizebits & 0x07;
+ size = (sizebits & ~0x80) >> 3;
+ i = 4;
+ while (sizebits & 0x80) {
+ sizebits = *pack++;
+ size |= (sizebits & ~0x80) << i;
+ i += 7;
+ }
+ left = pack_size - ((void *)pack - pack_base);
+ switch (type) {
+ case PACKED_COMMIT:
+ case PACKED_TREE:
+ case PACKED_BLOB:
+ case PACKED_TAG:
+ unpack_non_delta_entry(entry, type_s[type], pack, size, left);
break;
- case 'D':
- unpack_delta_entry(entry, pack+5, size, left);
+ case PACKED_DELTA:
+ unpack_delta_entry(entry, pack, size, left);
break;
default:
- die("corrupted pack file");
+ die("corrupted pack file(unknown object type %d)", type);
}
}
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-28 21:08 ` Nicolas Pitre
@ 2005-06-28 21:27 ` Linus Torvalds
2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-28 21:27 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason
On Tue, 28 Jun 2005, Nicolas Pitre wrote:
>
> OK. New patch below.
Dammit, I wasted all that time doing it myself.
I just committed and pushed out my version. But mine also does sha1_file.c
right, so that you can use a packed archive in .git/objects/pack. Yours
has some other cleanups, so..
Can you double-check my version (it hasn't mirrored out yet, it seems, but
it should be there soon).
> Probably a version signature would be a good thing too.
I did that too, same format as for the index file. Nothing checks it
though.
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH] Bugfix: initialize pack_base to NULL.
2005-06-28 21:27 ` Linus Torvalds
@ 2005-06-28 21:55 ` Junio C Hamano
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-28 21:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
This was causing random segfaults, because use_packed_git() got
confused by random garbage there.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
sha1_file.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
facb119577a28bbb3f2ac1e5f8db37fd2f6d31d8
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -395,6 +395,7 @@ static struct packed_git *add_packed_git
p->pack_size = st.st_size;
p->index_base = idx_map;
p->next = NULL;
+ p->pack_base = NULL;
p->pack_last_used = 0;
return p;
}
------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-28 21:27 ` Linus Torvalds
2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
@ 2005-06-29 3:55 ` Nicolas Pitre
2005-06-29 5:16 ` Nicolas Pitre
1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-29 3:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Tue, 28 Jun 2005, Linus Torvalds wrote:
>
>
> On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> >
> > OK. New patch below.
>
> Dammit, I wasted all that time doing it myself.
>
> I just committed and pushed out my version. But mine also does sha1_file.c
> right, so that you can use a packed archive in .git/objects/pack. Yours
> has some other cleanups, so..
>
> Can you double-check my version (it hasn't mirrored out yet, it seems, but
> it should be there soon).
OK... See below the cleanups I merged from my version on top of yours:
pack-objects.c | 70 ++++++++++++++-----------------------------------------
pack.h | 17 ++++++++-----
unpack-objects.c | 66 +++++++++++++++++++++++++--------------------------
3 files changed, 63 insertions(+), 90 deletions(-)
I also restored my original object header size ordering (little endian)
for two reasons:
- it is much simpler to generate and therefore allows for removing
quite some code
- it allows for stable bit position which makes it much easier to look
at an hex dump of the binary data for manual debugging
Also a few code optimizations and one error return fix.
Signed-off-by: Nicolas Pitre <nico@cam.org>
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -34,7 +34,7 @@ static void *delta_against(void *buf, un
if (!otherbuf)
die("unable to read %s", sha1_to_hex(entry->delta->sha1));
delta_buf = diff_delta(otherbuf, othersize,
- buf, size, &delta_size, ~0UL);
+ buf, size, &delta_size, 0UL);
if (!delta_buf || delta_size != entry->delta_size)
die("delta size changed");
free(buf);
@@ -42,54 +42,13 @@ static void *delta_against(void *buf, un
return delta_buf;
}
-/*
- * The per-object header is a pretty dense thing, which is
- * - first byte: low four bits are "size", then three bits of "type",
- * and the high bit is "size continues".
- * - each byte afterwards: low seven bits are size continuation,
- * with the high bit being "size continues"
- */
-static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
-{
- int n = 1, i;
- unsigned char c;
-
- if (type < OBJ_COMMIT || type > OBJ_DELTA)
- die("bad type %d", type);
-
- /*
- * Shift the size up by 7 bits at a time,
- * until you get bits in the "high four".
- * That will be our beginning. We'll have
- * four size bits in 28..31, then groups
- * of seven in 21..27, 14..20, 7..13 and
- * finally 0..6.
- */
- if (size) {
- n = 5;
- while (!(size & 0xfe000000)) {
- size <<= 7;
- n--;
- }
- }
- c = (type << 4) | (size >> 28);
- for (i = 1; i < n; i++) {
- *hdr++ = c | 0x80;
- c = (size >> 21) & 0x7f;
- size <<= 7;
- }
- *hdr = c;
- return n;
-}
-
static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
{
unsigned long size;
char type[10];
void *buf = read_sha1_file(entry->sha1, type, &size);
- unsigned char header[10];
+ char header[25];
unsigned hdrlen, datalen;
- enum object_type obj_type;
if (!buf)
die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -97,22 +56,31 @@ static unsigned long write_object(struct
die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
/*
- * The object header is a byte of 'type' followed by zero or
- * more bytes of length. For deltas, the 20 bytes of delta sha1
- * follows that.
+ * The object header first byte has its low 3 bits representing the
+ * object type, the 4 upper bits indicating which of the following
+ * bytes are used to build the object size. For delta objects the
+ * sha1 of the reference object is also appended.
*/
- obj_type = entry->type;
if (entry->delta) {
+ header[0] = OBJ_DELTA;
buf = delta_against(buf, size, entry);
size = entry->delta_size;
- obj_type = OBJ_DELTA;
+ } else
+ header[0] = entry->type;
+ header[0] |= size << 3;
+ hdrlen = 1;
+ datalen = size >> 4;
+ while (datalen) {
+ header[hdrlen - 1] |= 0x80;
+ header[hdrlen++] = datalen;
+ datalen >>= 7;
}
- hdrlen = encode_header(obj_type, size, header);
- sha1write(f, header, hdrlen);
if (entry->delta) {
- sha1write(f, entry->delta, 20);
+ memcpy(header+hdrlen, entry->delta, 20);
hdrlen += 20;
}
+
+ sha1write(f, header, hdrlen);
datalen = sha1write_compressed(f, buf, size);
free(buf);
return hdrlen + datalen;
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -1,13 +1,18 @@
#ifndef PACK_H
#define PACK_H
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
enum object_type {
- OBJ_NONE,
- OBJ_COMMIT,
- OBJ_TREE,
- OBJ_BLOB,
- OBJ_TAG,
- OBJ_DELTA,
+ OBJ_EXT = 0,
+ OBJ_COMMIT = 1,
+ OBJ_TREE = 2,
+ OBJ_BLOB = 3,
+ OBJ_TAG = 4,
+ OBJ_DELTA = 7
};
/*
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -13,6 +13,14 @@ struct pack_entry {
unsigned char sha1[20];
};
+static char *type_string[] = {
+ [OBJ_COMMIT] = "commit",
+ [OBJ_TREE] = "tree",
+ [OBJ_BLOB] = "blob",
+ [OBJ_TAG] = "tag",
+ [OBJ_DELTA] = "delta"
+};
+
static void *pack_base;
static unsigned long pack_size;
static void *index_base;
@@ -93,7 +101,7 @@ static int check_index(void)
}
static int unpack_non_delta_entry(struct pack_entry *entry,
- enum object_type kind,
+ char *type,
unsigned char *data,
unsigned long size,
unsigned long left)
@@ -102,9 +110,8 @@ static int unpack_non_delta_entry(struct
z_stream stream;
char *buffer;
unsigned char sha1[20];
- char *type;
- printf("%s %c %lu\n", sha1_to_hex(entry->sha1), ".CTBGD"[kind], size);
+ printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size);
if (dry_run)
return 0;
@@ -121,13 +128,6 @@ static int unpack_non_delta_entry(struct
inflateEnd(&stream);
if ((st != Z_STREAM_END) || stream.total_out != size)
goto err_finish;
- switch (kind) {
- case OBJ_COMMIT: type = "commit"; break;
- case OBJ_TREE: type = "tree"; break;
- case OBJ_BLOB: type = "blob"; break;
- case OBJ_TAG: type = "tag"; break;
- default: goto err_finish;
- }
if (write_sha1_file(buffer, size, type, sha1) < 0)
die("failed to write %s (%s)",
sha1_to_hex(entry->sha1), type);
@@ -135,8 +135,8 @@ static int unpack_non_delta_entry(struct
if (memcmp(sha1, entry->sha1, 20))
die("resulting %s have wrong SHA1", type);
- finish:
st = 0;
+ finish:
free(buffer);
return st;
err_finish:
@@ -185,15 +185,13 @@ static int unpack_delta_entry(struct pac
die("truncated pack file");
data = base_sha1 + 20;
data_size = left - 20;
- printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size);
+ printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size);
printf(" %s\n", sha1_to_hex(base_sha1));
if (dry_run)
return 0;
- /* pack+5 is the base sha1, unless we have it, we need to
- * unpack it first.
- */
+ /* unless we have the base sha1, we need to unpack it first. */
if (!has_sha1_file(base_sha1)) {
struct pack_entry *base;
if (!find_pack_entry(base_sha1, &base))
@@ -238,8 +236,9 @@ static int unpack_delta_entry(struct pac
static void unpack_entry(struct pack_entry *entry)
{
unsigned long offset, size, left;
- unsigned char *pack, c;
- int type;
+ unsigned char c, *pack = pack_base;
+ int i;
+ enum object_type type;
/* Have we done this one already due to deltas based on it? */
if (lookup_object(entry->sha1))
@@ -247,20 +246,17 @@ static void unpack_entry(struct pack_ent
offset = ntohl(entry->offset);
if (offset >= pack_size)
- goto bad;
-
- pack = pack_base + offset;
- c = *pack++;
- offset++;
- type = (c >> 4) & 7;
- size = (c & 15);
+ goto out_of_bound;
+ c = pack[offset++];
+ type = c & 0x07;
+ size = (c & ~0x80) >> 3;
+ i = 4;
while (c & 0x80) {
if (offset >= pack_size)
- goto bad;
- offset++;
- c = *pack++;
- size = (size << 7) + (c & 0x7f);
-
+ goto out_of_bound;
+ c = pack[offset++];
+ size |= (c & ~0x80) << i;
+ i += 7;
}
left = pack_size - offset;
switch (type) {
@@ -268,14 +264,18 @@ static void unpack_entry(struct pack_ent
case OBJ_TREE:
case OBJ_BLOB:
case OBJ_TAG:
- unpack_non_delta_entry(entry, type, pack, size, left);
+ unpack_non_delta_entry(entry, type_string[type],
+ pack+offset, size, left);
return;
case OBJ_DELTA:
- unpack_delta_entry(entry, pack, size, left);
+ unpack_delta_entry(entry, pack+offset, size, left);
return;
+ default:
+ die("corrupted pack file(unknown object type %d)", type);
}
-bad:
- die("corrupted pack file");
+
+ out_of_bound:
+ die("corrupted pack file (object offset out of bound)");
}
/*
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
@ 2005-06-29 5:16 ` Nicolas Pitre
2005-06-29 5:43 ` Linus Torvalds
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2005-06-29 5:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Tue, 28 Jun 2005, Nicolas Pitre wrote:
> OK... See below the cleanups I merged from my version on top of yours:
Of course by the time I sent the above you already rewrote the ting to
be streamable.
So again :-) see below the cleanups I merged from my version on top of
yours:
pack-objects.c | 70 ++++++++++++++-----------------------------------------
pack.h | 17 ++++++++-----
unpack-objects.c | 29 ++++++++++++----------
3 files changed, 46 insertions(+), 70 deletions(-)
I also restored my original object header size ordering (little endian)
for two reasons:
- it is much simpler to generate and therefore allows for removing
quite some code
- it allows for stable bit position which makes it much easier to look
at an hex dump of the binary data for manual debugging
Signed-off-by: Nicolas Pitre <nico@cam.org>
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -34,7 +34,7 @@ static void *delta_against(void *buf, un
if (!otherbuf)
die("unable to read %s", sha1_to_hex(entry->delta->sha1));
delta_buf = diff_delta(otherbuf, othersize,
- buf, size, &delta_size, ~0UL);
+ buf, size, &delta_size, 0UL);
if (!delta_buf || delta_size != entry->delta_size)
die("delta size changed");
free(buf);
@@ -42,54 +42,13 @@ static void *delta_against(void *buf, un
return delta_buf;
}
-/*
- * The per-object header is a pretty dense thing, which is
- * - first byte: low four bits are "size", then three bits of "type",
- * and the high bit is "size continues".
- * - each byte afterwards: low seven bits are size continuation,
- * with the high bit being "size continues"
- */
-static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
-{
- int n = 1, i;
- unsigned char c;
-
- if (type < OBJ_COMMIT || type > OBJ_DELTA)
- die("bad type %d", type);
-
- /*
- * Shift the size up by 7 bits at a time,
- * until you get bits in the "high four".
- * That will be our beginning. We'll have
- * four size bits in 28..31, then groups
- * of seven in 21..27, 14..20, 7..13 and
- * finally 0..6.
- */
- if (size) {
- n = 5;
- while (!(size & 0xfe000000)) {
- size <<= 7;
- n--;
- }
- }
- c = (type << 4) | (size >> 28);
- for (i = 1; i < n; i++) {
- *hdr++ = c | 0x80;
- c = (size >> 21) & 0x7f;
- size <<= 7;
- }
- *hdr = c;
- return n;
-}
-
static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
{
unsigned long size;
char type[10];
void *buf = read_sha1_file(entry->sha1, type, &size);
- unsigned char header[10];
+ char header[25];
unsigned hdrlen, datalen;
- enum object_type obj_type;
if (!buf)
die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -97,22 +56,31 @@ static unsigned long write_object(struct
die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
/*
- * The object header is a byte of 'type' followed by zero or
- * more bytes of length. For deltas, the 20 bytes of delta sha1
- * follows that.
+ * The object header first byte has its low 3 bits representing the
+ * object type, the 4 upper bits indicating which of the following
+ * bytes are used to build the object size. For delta objects the
+ * sha1 of the reference object is also appended.
*/
- obj_type = entry->type;
if (entry->delta) {
+ header[0] = OBJ_DELTA;
buf = delta_against(buf, size, entry);
size = entry->delta_size;
- obj_type = OBJ_DELTA;
+ } else
+ header[0] = entry->type;
+ header[0] |= size << 3;
+ hdrlen = 1;
+ datalen = size >> 4;
+ while (datalen) {
+ header[hdrlen - 1] |= 0x80;
+ header[hdrlen++] = datalen;
+ datalen >>= 7;
}
- hdrlen = encode_header(obj_type, size, header);
- sha1write(f, header, hdrlen);
if (entry->delta) {
- sha1write(f, entry->delta, 20);
+ memcpy(header+hdrlen, entry->delta, 20);
hdrlen += 20;
}
+
+ sha1write(f, header, hdrlen);
datalen = sha1write_compressed(f, buf, size);
free(buf);
return hdrlen + datalen;
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -1,13 +1,18 @@
#ifndef PACK_H
#define PACK_H
+/*
+ * The packed object type is stored in the low 3 bits of a byte.
+ * The type value 0 is a reserved prefix if ever there is more than 7
+ * object types, or any future format extensions.
+ */
enum object_type {
- OBJ_NONE,
- OBJ_COMMIT,
- OBJ_TREE,
- OBJ_BLOB,
- OBJ_TAG,
- OBJ_DELTA,
+ OBJ_EXT = 0,
+ OBJ_COMMIT = 1,
+ OBJ_TREE = 2,
+ OBJ_BLOB = 3,
+ OBJ_TAG = 4,
+ OBJ_DELTA = 7
};
/*
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -6,6 +6,14 @@
static int dry_run;
static const char unpack_usage[] = "git-unpack-objects < pack-file";
+static char *type_string[] = {
+ [OBJ_COMMIT] = "commit",
+ [OBJ_TREE] = "tree",
+ [OBJ_BLOB] = "blob",
+ [OBJ_TAG] = "tag",
+ [OBJ_DELTA] = "delta"
+};
+
/* We always read in 4kB chunks. */
static unsigned char buffer[4096];
static unsigned long offset, len, eof;
@@ -139,19 +147,11 @@ static void added_object(unsigned char *
}
}
-static int unpack_non_delta_entry(enum object_type kind, unsigned long size)
+static int unpack_non_delta_entry(char *type, unsigned long size)
{
void *buf = get_data(size);
unsigned char sha1[20];
- char *type;
- switch (kind) {
- case OBJ_COMMIT: type = "commit"; break;
- case OBJ_TREE: type = "tree"; break;
- case OBJ_BLOB: type = "blob"; break;
- case OBJ_TAG: type = "tag"; break;
- default: die("bad type %d", kind);
- }
if (write_sha1_file(buf, size, type, sha1) < 0)
die("failed to write object");
added_object(sha1, type, buf, size);
@@ -184,26 +184,29 @@ static int unpack_delta_entry(unsigned l
static void unpack_one(void)
{
unsigned char *pack, c;
+ unsigned int i;
unsigned long size;
enum object_type type;
pack = fill(1);
c = *pack;
use(1);
- type = (c >> 4) & 7;
- size = (c & 15);
+ type = c & 0x07;
+ size = (c & ~0x80) >> 3;
+ i = 4;
while (c & 0x80) {
pack = fill(1);
c = *pack++;
use(1);
- size = (size << 7) + (c & 0x7f);
+ size |= (c & ~0x80) << i;
+ i += 7;
}
switch (type) {
case OBJ_COMMIT:
case OBJ_TREE:
case OBJ_BLOB:
case OBJ_TAG:
- unpack_non_delta_entry(type, size);
+ unpack_non_delta_entry(type_string[type], size);
return;
case OBJ_DELTA:
unpack_delta_entry(size);
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-29 5:16 ` Nicolas Pitre
@ 2005-06-29 5:43 ` Linus Torvalds
2005-06-29 5:54 ` Linus Torvalds
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29 5:43 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List
On Wed, 29 Jun 2005, Nicolas Pitre wrote:
>
> Of course by the time I sent the above you already rewrote the ting to
> be streamable.
And by the time you sent me a new version, I'd already taken part of your
old one by hand ;)
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
2005-06-29 5:43 ` Linus Torvalds
@ 2005-06-29 5:54 ` Linus Torvalds
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29 5:54 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List
On Tue, 28 Jun 2005, Linus Torvalds wrote:
>
> On Wed, 29 Jun 2005, Nicolas Pitre wrote:
> >
> > Of course by the time I sent the above you already rewrote the ting to
> > be streamable.
>
> And by the time you sent me a new version, I'd already taken part of your
> old one by hand ;)
Btw, I have the size/type bits reversed from your setup, but please don't
change that, since that would be yet another incompatible pack format
change, and I'd like to calm things down.
Also, I notice that you decode the sizes really strangely: you have a
"while() { }" loop and two separate loads. It's much nicer to do it with a
"do { } while()" loop and a single load, since not only is it less code,
a do-while loop compiles to better code than a while() loop (unless the
compiler is crazy, which it sometimes is).
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Last mile for 1.0 again
2005-06-29 5:54 ` Linus Torvalds
@ 2005-06-29 7:16 ` Junio C Hamano
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
0 siblings, 2 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-06-29 7:16 UTC (permalink / raw)
To: git; +Cc: Linus Torvalds
>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
LT> ..., but please don't
LT> change that, since that would be yet another incompatible pack format
LT> change, and I'd like to calm things down.
Calming things down is good. Some stuff we should be looking at
during calming down period are:
- Ask expert help to verify our use of zlib is OK; Linus asked
for this in a message tonight. I am not qualified to do this
myself.
- There is currently nothing that verifies the SHA1 sums
embedded in idx and pack files. A tool for it is needed
[*1*]. I may be doing this myself.
- Although there is a hook in sha1_file() that allows mapping a
pack file on demand (and keep track of their uses for LRU
eviction), there is currently no code to actually evict a
mapped pack file. We need to add some sort of use-reference
counting to lock down a pack file in use, if somebody is
inclined to do this. I will not be doing this myself
immediately.
- Think about what to do with the "dumb server" pull protocols
regarding packed archive files (think about http-pull). I
will not be doing this myself.
- Design "smart server" pull/push pair. This can independently
and actively be done without impacting "calming down" period,
and I am assuming Dan is already looking into this. This may
result in another "deathmatch" between Dan and Jason
Mcmullan, which would be fun to watch ;-).
Not limited to "calming down the packed GIT", but bigger picture
pre-release preparation items I think we need to look into are:
- Update "Last mile for 1.0" list. I think the only thing that
is left from the one I posted on Jun 5th is the tutorials.
Anybody interested in adding more pages to the tutorials? If
there is no taker, I may start continuing what is already
there, stealing example project outline from "Arch meets
Hello World" [*2*].
- Double check the documentation, usage strings in the code and
what the code does still match. Last time I checked, I think
some files in Documentation/ directory were not reachable
from the main GIT(1) page and were not touched by the build
process from Documentation/Makefile. I will not be doing
this myself.
- Blame/Annotate. Does anybody have a fast and correct one
[*3*]?
- Publicity. Somebody may want to start preparing materials to
have us included in http://better-scm.berlios.de/comparison/
[Footnotes]
*1* One possibility is to add that to fsck-cache when --full is
used, but I am somewhat reluctant to do it that way. It would
require you to place that "suspicious" pack under your
repository's .git/objects/pack in order to verify it, which can
spread the damage before you know it.
In general, idx file is relatively small, so we _could_ check
the SHA1 sum for idx files when we map them at runtime in
check_packed_git_idx(). However, verifying 1MB (kernel archive
case) idx file every time we run a single core GIT command may
be a price we would not want to pay, so I am not too
enthusiastic about doing it. We certainly would not want to do
this for pack files every time a pack is mapped at runtime (60MB
kernel archive case).
*2* http://regexps.srparish.net/www/tutorial/html/arch.html.
I am not talking about stealing the concepts but just talking
about stealing the sample project on which example players work
in the sample sessions. I think its section on update/replay
matches closely to merge/rebase, for example.
*3* I have what I wrote in Perl which does rename/copy/rewrite
and multiple parents correctly, but it is way too slow for
on-demand use.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH] Add git-verify-pack command.
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
@ 2005-06-29 9:51 ` Junio C Hamano
2005-06-29 16:15 ` Linus Torvalds
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-06-29 9:51 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Given a list of <pack>.idx files, this command validates the
index file and the corresponding .pack file for consistency.
This patch also uses the same validation mechanism in fsck-cache
when the --full flag is used.
During normal operation, sha1_file.c verifies that a given .idx
file matches the .pack file by comparing the SHA1 checksum
stored in .idx file and .pack file as a minimum sanity check.
We may further want to check the pack signature and version when
we map the pack, but that would be a separate patch.
Earlier, errors to map a pack file was not flagged fatal but led
to a random fatal error later. This version explicitly die()s
when such an error is detected.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
*** This actually covers two of the items I listed in the "Last
*** mile again" message. I ended up doing the LRU myself.
*** Linus, what would be the practical/recommended/sane limit
*** for mmap regions we would want to use? Currently I start
*** throwing away mapped packs when the total size of mapped
*** packs exceeds 64MB.
*** Currently, pack-objects/unpack-objects pair is not
*** documented. Any takers?
*** Oh, again, Linus, what is your preferred way to get a
*** cover-letter material (like this) for a single patch? I
*** always do a separate cover letter for multi-patch series
*** with [PATCH 0/N], but do you prefer to get [PATCH 0/1] and
*** [PATCH 1/1] for a single patch, or do you prefer to see
*** commentaries after the three-dash line like this message
*** does?
Documentation/git-verify-pack.txt | 38 +++++++++++++++++++++
Documentation/git.txt | 3 ++
Makefile | 5 ++-
cache.h | 4 ++
fsck-cache.c | 5 +++
pack.h | 2 +
sha1_file.c | 66 ++++++++++++++++++++++++++++---------
t/t5300-pack-object.sh | 38 +++++++++++++++++++++
verify-pack.c | 26 +++++++++++++++
verify_pack.c | 62 +++++++++++++++++++++++++++++++++++
10 files changed, 231 insertions(+), 18 deletions(-)
create mode 100644 Documentation/git-verify-pack.txt
create mode 100644 verify-pack.c
create mode 100644 verify_pack.c
8ac4f374a587c58b3d2ecc44220b6b866f769350
diff --git a/Documentation/git-verify-pack.txt b/Documentation/git-verify-pack.txt
new file mode 100644
--- /dev/null
+++ b/Documentation/git-verify-pack.txt
@@ -0,0 +1,38 @@
+git-verify-pack(1)
+==================
+v0.1, June 2005
+
+NAME
+----
+git-verify-pack - Validate packed GIT archive files.
+
+
+SYNOPSIS
+--------
+'git-verify-pack' <pack>.idx ...
+
+
+DESCRIPTION
+-----------
+Reads given idx file for packed GIT archive created with
+git-pack-objects command and verifies idx file and the
+corresponding pack file.
+
+OPTIONS
+-------
+<pack>.idx ...::
+ The idx files to verify.
+
+
+Author
+------
+Written by Junio C Hamano <junkio@cox.net>
+
+Documentation
+--------------
+Documentation by Junio C Hamano
+
+GIT
+---
+Part of the link:git.html[git] suite
+
diff --git a/Documentation/git.txt b/Documentation/git.txt
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -110,6 +110,9 @@ link:git-tar-tree.html[git-tar-tree]::
link:git-unpack-file.html[git-unpack-file]::
Creates a temporary file with a blob's contents
+link:git-verify-pack.html[git-verify-pack]::
+ Validates packed GIT archive files
+
The interrogate commands may create files - and you can force them to
touch the working file set - but in general they don't
diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -36,7 +36,7 @@ PROG= git-update-cache git-diff-files
git-diff-helper git-tar-tree git-local-pull git-write-blob \
git-get-tar-commit-id git-apply git-stripspace \
git-cvs2git git-diff-stages git-rev-parse git-patch-id \
- git-pack-objects git-unpack-objects
+ git-pack-objects git-unpack-objects git-verify-pack
all: $(PROG)
@@ -45,7 +45,7 @@ install: $(PROG) $(SCRIPTS)
LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
tag.o date.o index.o diff-delta.o patch-delta.o entry.o \
- epoch.o refs.o csum-file.o
+ epoch.o refs.o csum-file.o verify_pack.o
LIB_FILE=libgit.a
LIB_H=cache.h object.h blob.h tree.h commit.h tag.h delta.h epoch.h csum-file.h pack.h
@@ -124,6 +124,7 @@ git-rev-parse: rev-parse.c
git-patch-id: patch-id.c
git-pack-objects: pack-objects.c
git-unpack-objects: unpack-objects.c
+git-verify-pack: verify-pack.c
git-http-pull: LIBS += -lcurl
git-rev-list: LIBS += -lssl
diff --git a/cache.h b/cache.h
--- a/cache.h
+++ b/cache.h
@@ -246,9 +246,13 @@ extern struct packed_git {
unsigned int *index_base;
void *pack_base;
unsigned int pack_last_used;
+ unsigned int pack_use_cnt;
char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */
} *packed_git;
extern void prepare_packed_git(void);
+extern int use_packed_git(struct packed_git *);
+extern void unuse_packed_git(struct packed_git *);
+extern struct packed_git *add_packed_git(char *, int);
extern int num_packed_objects(const struct packed_git *p);
extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*);
diff --git a/fsck-cache.c b/fsck-cache.c
--- a/fsck-cache.c
+++ b/fsck-cache.c
@@ -6,6 +6,7 @@
#include "tree.h"
#include "blob.h"
#include "tag.h"
+#include "pack.h"
#define REACHABLE 0x0001
@@ -437,6 +438,10 @@ int main(int argc, char **argv)
alt_odb[j].name[-1] = '/';
}
prepare_packed_git();
+ for (p = packed_git; p; p = p->next)
+ /* verify gives error messages itself */
+ verify_pack(p);
+
for (p = packed_git; p; p = p->next) {
int num = num_packed_objects(p);
for (i = 0; i < num; i++) {
diff --git a/pack.h b/pack.h
--- a/pack.h
+++ b/pack.h
@@ -27,4 +27,6 @@ struct pack_header {
unsigned int hdr_entries;
};
+extern int verify_pack(struct packed_git *);
+
#endif
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -302,7 +302,7 @@ static int check_packed_git_idx(const ch
index = idx_map;
/* check index map */
- if (idx_size < 4*256 + 20)
+ if (idx_size < 4*256 + 20 + 20)
return error("index file too small");
nr = 0;
for (i = 0; i < 256; i++) {
@@ -327,12 +327,29 @@ static int check_packed_git_idx(const ch
return 0;
}
-static void unuse_one_packed_git(void)
+static int unuse_one_packed_git(void)
{
- /* NOTYET */
+ struct packed_git *p, *lru = NULL;
+
+ for (p = packed_git; p; p = p->next) {
+ if (p->pack_use_cnt || !p->pack_base)
+ continue;
+ if (!lru || p->pack_last_used < lru->pack_last_used)
+ lru = p;
+ }
+ if (!lru)
+ return 0;
+ munmap(lru->pack_base, lru->pack_size);
+ lru->pack_base = NULL;
+ return 1;
+}
+
+void unuse_packed_git(struct packed_git *p)
+{
+ p->pack_use_cnt--;
}
-static int use_packed_git(struct packed_git *p)
+int use_packed_git(struct packed_git *p)
{
if (!p->pack_base) {
int fd;
@@ -340,28 +357,36 @@ static int use_packed_git(struct packed_
void *map;
pack_mapped += p->pack_size;
- while (PACK_MAX_SZ < pack_mapped)
- unuse_one_packed_git();
+ while (PACK_MAX_SZ < pack_mapped && unuse_one_packed_git())
+ ; /* nothing */
fd = open(p->pack_name, O_RDONLY);
if (fd < 0)
- return -1;
+ die("packfile %s cannot be opened", p->pack_name);
if (fstat(fd, &st)) {
close(fd);
- return -1;
+ die("packfile %s cannot be opened", p->pack_name);
}
if (st.st_size != p->pack_size)
- return -1;
+ die("packfile %s size mismatch.", p->pack_name);
map = mmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
if (map == MAP_FAILED)
- return -1;
+ die("packfile %s cannot be mapped.", p->pack_name);
p->pack_base = map;
+
+ /* Check if the pack file matches with the index file.
+ * this is cheap.
+ */
+ if (memcmp((char*)(p->index_base) + p->index_size - 40,
+ p->pack_base + p->pack_size - 20, 20))
+ die("packfile %s does not match index.", p->pack_name);
}
p->pack_last_used = pack_used_ctr++;
+ p->pack_use_cnt++;
return 0;
}
-static struct packed_git *add_packed_git(char *path, int path_len)
+struct packed_git *add_packed_git(char *path, int path_len)
{
struct stat st;
struct packed_git *p;
@@ -388,6 +413,7 @@ static struct packed_git *add_packed_git
p->next = NULL;
p->pack_base = NULL;
p->pack_last_used = 0;
+ p->pack_use_cnt = 0;
return p;
}
@@ -671,6 +697,7 @@ static int packed_object_info(struct pac
unsigned long offset, size, left;
unsigned char *pack;
enum object_type kind;
+ int retval;
if (use_packed_git(p))
die("cannot map packed file");
@@ -681,8 +708,9 @@ static int packed_object_info(struct pac
switch (kind) {
case OBJ_DELTA:
- return packed_delta_info(pack, size, left, type, sizep);
- break;
+ retval = packed_delta_info(pack, size, left, type, sizep);
+ unuse_packed_git(p);
+ return retval;
case OBJ_COMMIT:
strcpy(type, "commit");
break;
@@ -699,6 +727,7 @@ static int packed_object_info(struct pac
die("corrupted pack file");
}
*sizep = size;
+ unuse_packed_git(p);
return 0;
}
@@ -785,6 +814,7 @@ static void *unpack_entry(struct pack_en
unsigned long offset, size, left;
unsigned char *pack;
enum object_type kind;
+ void *retval;
if (use_packed_git(p))
die("cannot map packed file");
@@ -794,7 +824,9 @@ static void *unpack_entry(struct pack_en
left = p->pack_size - offset;
switch (kind) {
case OBJ_DELTA:
- return unpack_delta_entry(pack, size, left, type, sizep);
+ retval = unpack_delta_entry(pack, size, left, type, sizep);
+ unuse_packed_git(p);
+ return retval;
case OBJ_COMMIT:
strcpy(type, "commit");
break;
@@ -811,12 +843,14 @@ static void *unpack_entry(struct pack_en
die("corrupted pack file");
}
*sizep = size;
- return unpack_non_delta_entry(pack, size, left);
+ retval = unpack_non_delta_entry(pack, size, left);
+ unuse_packed_git(p);
+ return retval;
}
int num_packed_objects(const struct packed_git *p)
{
- /* See check_packed_git_idx and pack-objects.c */
+ /* See check_packed_git_idx() */
return (p->index_size - 20 - 20 - 4*256) / 24;
}
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -127,4 +127,42 @@ test_expect_success \
} >current &&
diff expect current'
+unset GIT_OBJECT_DIRECTORY
+
+test_expect_success \
+ 'verify pack' \
+ 'git-verify-pack test-1.idx test-2.idx'
+
+test_expect_success \
+ 'corrupt a pack and see if verify catches' \
+ 'cp test-1.idx test-3.idx &&
+ cp test-2.pack test-3.pack &&
+ if git-verify-pack test-3.idx
+ then false
+ else :;
+ fi &&
+
+ cp test-1.pack test-3.pack &&
+ dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=2 &&
+ if git-verify-pack test-3.idx
+ then false
+ else :;
+ fi &&
+
+ cp test-1.pack test-3.pack &&
+ dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=7 &&
+ if git-verify-pack test-3.idx
+ then false
+ else :;
+ fi &&
+
+ cp test-1.pack test-3.pack &&
+ dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=12 &&
+ if git-verify-pack test-3.idx
+ then false
+ else :;
+ fi &&
+
+ :'
+
test_done
diff --git a/verify-pack.c b/verify-pack.c
new file mode 100644
--- /dev/null
+++ b/verify-pack.c
@@ -0,0 +1,26 @@
+#include "cache.h"
+#include "pack.h"
+
+static int verify_one_pack(char *arg)
+{
+ struct packed_git *g = add_packed_git(arg, strlen(arg));
+ if (!g)
+ return -1;
+ return verify_pack(g);
+}
+
+int main(int ac, char **av)
+{
+ int errs = 0;
+
+ while (1 < ac) {
+ char path[PATH_MAX];
+ strcpy(path, av[1]);
+ if (verify_one_pack(path))
+ errs++;
+ else
+ printf("%s: OK\n", av[1]);
+ ac--; av++;
+ }
+ return !!errs;
+}
diff --git a/verify_pack.c b/verify_pack.c
new file mode 100644
--- /dev/null
+++ b/verify_pack.c
@@ -0,0 +1,62 @@
+#include "cache.h"
+#include "pack.h"
+
+static int verify_packfile(struct packed_git *p)
+{
+ unsigned long index_size = p->index_size;
+ void *index_base = p->index_base;
+ SHA_CTX ctx;
+ unsigned char sha1[20];
+ unsigned long pack_size = p->pack_size;
+ void *pack_base;
+ struct pack_header *hdr;
+ int nr_objects;
+
+ hdr = p->pack_base;
+ if (hdr->hdr_signature != htonl(PACK_SIGNATURE))
+ return error("Packfile signature mismatch", p->pack_name);
+ if (hdr->hdr_version != htonl(PACK_VERSION))
+ return error("Packfile version %d different from ours %d",
+ ntohl(hdr->hdr_version), PACK_VERSION);
+ nr_objects = ntohl(hdr->hdr_entries);
+ if (num_packed_objects(p) != nr_objects)
+ return error("Packfile claims to have %d objects, "
+ "while idx size expects %d", nr_objects,
+ num_packed_objects(p));
+
+ SHA1_Init(&ctx);
+ pack_base = p->pack_base;
+ SHA1_Update(&ctx, pack_base, pack_size - 20);
+ SHA1_Final(sha1, &ctx);
+ if (memcmp(sha1, index_base + index_size - 40, 20))
+ return error("Packfile %s SHA1 mismatch with idx",
+ p->pack_name);
+ if (memcmp(sha1, pack_base + pack_size - 20, 20))
+ return error("Packfile %s SHA1 mismatch with itself",
+ p->pack_name);
+ return 0;
+}
+
+
+int verify_pack(struct packed_git *p)
+{
+ unsigned long index_size = p->index_size;
+ void *index_base = p->index_base;
+ SHA_CTX ctx;
+ unsigned char sha1[20];
+ int ret;
+
+ /* Verify SHA1 sum of the index file */
+ SHA1_Init(&ctx);
+ SHA1_Update(&ctx, index_base, index_size - 20);
+ SHA1_Final(sha1, &ctx);
+ if (memcmp(sha1, index_base + index_size - 20, 20))
+ return error("Packfile index for %s SHA1 mismatch",
+ p->pack_name);
+
+ /* Verify pack file */
+ use_packed_git(p);
+ ret = verify_packfile(p);
+ unuse_packed_git(p);
+ return ret;
+}
------------
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] Add git-verify-pack command.
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
@ 2005-06-29 16:15 ` Linus Torvalds
0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2005-06-29 16:15 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, 29 Jun 2005, Junio C Hamano wrote:
>
> *** Linus, what would be the practical/recommended/sane limit
> *** for mmap regions we would want to use? Currently I start
> *** throwing away mapped packs when the total size of mapped
> *** packs exceeds 64MB.
Hey, anybody who ever used BK had better have had 1GB of memory for any
real development. So 64MB is peanuts, but it sounds like a good guess.
> *** Oh, again, Linus, what is your preferred way to get a
> *** cover-letter material (like this) for a single patch?
I don't want cover-letters for single patches, or necessarily even for
short series (1-3 entries). The cover letter is more interesting for large
series, or even with small series if the early patches don't make much
sense on their own: then the cover-letter ends up being a useful place to
explain what the _sequence_ does, and why patch #2 that seems to be
totally useless and removes a feature is actually good ("we'll
re-implement it better in #5 after we've cleaned the code up").
So generally commentaries after the three dashes is good, if the
commentary is "local", ie related not to a series. Only with non-local
explanations does a separate [0/N] thing make sense to me.
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
@ 2005-07-03 2:51 linux
0 siblings, 0 replies; 39+ messages in thread
From: linux @ 2005-07-03 2:51 UTC (permalink / raw)
To: git, torvalds
> unsigned long size;
> unsigned char c;
>
> c = *pack++;
> size = c & 15;
> type = (c >> 4) & 7;
> while (c & 0x80) {
> c = *pack++;
> size = (size << 7) + (c & 0x7f);
> }
>
> or something. That's even denser.
If you're going for density, you missed something. Try:
c = *pack++;
size = c & 15;
type = (c >> 4) & 7;
while (c & 0x80) {
c = *pack++;
size = (size << 7) + (c & 0x7f) + 16;
}
Encoding is most easily done in little-endian order, such as:
static unsigned
encode(unsigned char *p, unsigned long x)
{
unsigned char *q = p;
unsigned char buf[5];
unsigned char *b = buf;
while (x > 15) {
assert(b < buf+5);
x -= 16;
*b++ = x & 0x7f;
x >>= 7;
}
*b = x;
while (b != buf)
*q++ = *b-- | 0x80;
*q++ = *b;
return (unsigned)(q - p);
}
(You'll probably want to rewrite the above, but it's abandoned to the
public domain in any case. Go nuts.)
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
@ 2005-07-04 21:40 ` Daniel Barkalow
2005-07-04 21:45 ` Junio C Hamano
2005-07-04 21:59 ` Linus Torvalds
1 sibling, 2 replies; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-04 21:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Linus Torvalds
On Wed, 29 Jun 2005, Junio C Hamano wrote:
> - Blame/Annotate. Does anybody have a fast and correct one
How about an option to git-rev-list to take a path, and (1) exclude any
branch where the version at that path ends up ignored in a merge and
(2) not list any revision where the version at that path is identical to a
parent?
This should give you the list of all commits which are directly
responsible for the present state of the file, which can then be formatted
as desired.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
@ 2005-07-04 21:45 ` Junio C Hamano
2005-07-04 21:59 ` Linus Torvalds
1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2005-07-04 21:45 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: git, Linus Torvalds
>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
DB> On Wed, 29 Jun 2005, Junio C Hamano wrote:
>> - Blame/Annotate. Does anybody have a fast and correct one
DB> How about an option to git-rev-list to take a path, and (1) exclude any
DB> branch where the version at that path ends up ignored in a merge and
DB> (2) not list any revision where the version at that path is identical to a
DB> parent?
DB> This should give you the list of all commits which are directly
DB> responsible for the present state of the file, which can then be formatted
DB> as desired.
Sounds close enough if you do not care about copies and
complete rewrites.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45 ` Junio C Hamano
@ 2005-07-04 21:59 ` Linus Torvalds
2005-07-04 22:41 ` Daniel Barkalow
1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2005-07-04 21:59 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Junio C Hamano, git
On Mon, 4 Jul 2005, Daniel Barkalow wrote:
>
> How about an option to git-rev-list to take a path, and (1) exclude any
> branch where the version at that path ends up ignored in a merge and
> (2) not list any revision where the version at that path is identical to a
> parent?
Hmm. How is that different from "git-whatchanged path", really?
Linus
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-04 21:59 ` Linus Torvalds
@ 2005-07-04 22:41 ` Daniel Barkalow
2005-07-04 23:06 ` Junio C Hamano
0 siblings, 1 reply; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-04 22:41 UTC (permalink / raw)
To: Linus Torvalds, Junio C Hamano; +Cc: git
On Mon, 4 Jul 2005, Linus Torvalds wrote:
> On Mon, 4 Jul 2005, Daniel Barkalow wrote:
> >
> > How about an option to git-rev-list to take a path, and (1) exclude any
> > branch where the version at that path ends up ignored in a merge and
> > (2) not list any revision where the version at that path is identical to a
> > parent?
>
> Hmm. How is that different from "git-whatchanged path", really?
It would short-circuit going up areas of the history which don't
contribute (i.e., lead up to a merge which took its version from a
different parent). It could also stop when it ran out of branches that
have the file at all. Neither of these is all that significant, I guess.
Junio: what's missing from annotate/blame?
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-04 22:41 ` Daniel Barkalow
@ 2005-07-04 23:06 ` Junio C Hamano
2005-07-05 1:54 ` Daniel Barkalow
0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-07-04 23:06 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Linus Torvalds, git
>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
DB> Junio: what's missing from annotate/blame?
Which one are you talking about?
What I use to generate http://members.cox.net/junkio/Summary.txt
is an implementation of an algorithm I consider "complete" in
that it does rename/copy and complete rewrite correctly. What
is missing from the implementation is efficiency.
------------
#!/usr/bin/perl -w
use strict;
package main;
$::debug = 0;
sub read_blob {
my $sha1 = shift;
my $fh = undef;
my $result;
local ($/) = undef;
open $fh, '-|', 'git-cat-file', 'blob', $sha1
or die "cannot read blob $sha1";
$result = join('', <$fh>);
close $fh
or die "failure while closing pipe to git-cat-file";
return $result;
}
sub read_diff_raw {
my ($parent, $filename) = @_;
my $fh = undef;
local ($/) = "\0";
my @result = ();
my ($meta, $status, $sha1_1, $sha1_2, $file1, $file2);
print STDERR "* diff-cache --cached $parent $filename\n" if $::debug;
my $has_changes = 0;
open $fh, '-|', 'git-diff-cache', '--cached', '-z', $parent, $filename
or die "cannot read git-diff-cache $parent $filename";
while (defined ($meta = <$fh>)) {
$has_changes = 1;
}
close $fh
or die "failure while closing pipe to git-diff-cache";
if (!$has_changes) {
return ();
}
$fh = undef;
print STDERR "* diff-cache -B -C --find-copies-harder --cached $parent\n" if $::debug;
open($fh, '-|', 'git-diff-cache', '-B', '-C', '--find-copies-harder',
'--cached', '-z', $parent)
or die "cannot read git-diff-cache with $parent";
while (defined ($meta = <$fh>)) {
chomp($meta);
(undef, undef, $sha1_1, $sha1_2, $status) = split(/ /, $meta);
$file1 = <$fh>;
chomp($file1);
if ($status =~ /^[CR]/) {
$file2 = <$fh>;
chomp($file2);
} elsif ($status =~ /^D/) {
next;
} else {
$file2 = $file1;
}
if ($file2 eq $filename) {
push @result, [$status, $sha1_1, $sha1_2, $file1, $file2];
}
}
close $fh
or die "failure while closing pipe to git-diff-cache";
return @result;
}
sub write_temp_blob {
my ($sha1, $temp) = @_;
my $fh = undef;
my $blob = read_blob($sha1);
open $fh, '>', $temp
or die "cannot open temporary file $temp";
print $fh $blob;
close($fh);
}
package Git::Patch;
sub new {
my ($class, $sha1_1, $sha1_2) = @_;
my $self = bless [], $class;
my $fh = undef;
::write_temp_blob($sha1_1, "/tmp/blame-$$-1");
::write_temp_blob($sha1_2, "/tmp/blame-$$-2");
open $fh, '-|', 'diff', '-u0', "/tmp/blame-$$-1", "/tmp/blame-$$-2"
or die "cannot read diff";
while (<$fh>) {
if (/^\@\@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? \@\@/) {
push @$self, [$1, (defined $2 ? $2 : 1),
$3, (defined $4 ? $4 : 1)];
}
}
close $fh;
unlink "/tmp/blame-$$-1", "/tmp/blame-$$-2";
return $self;
}
sub find_parent_line {
my ($self, $commit_lineno) = @_;
my $ofs = 0;
for (@$self) {
my ($line_1, $len_1, $line_2, $len_2) = @$_;
if ($commit_lineno < $line_2) {
return $commit_lineno - $ofs;
}
if ($line_2 <= $commit_lineno && $commit_lineno < $line_2 + $len_2) {
return -1; # changed by commit.
}
$ofs += ($len_1 - $len_2);
}
return $commit_lineno + $ofs;
}
package Git::Commit;
my %author_name_canon =
('Linus Torvalds <torvalds@ppc970.osdl.org.(none)>' =>
'Linus Torvalds <torvalds@ppc970.osdl.org>');
sub canon_author_name {
my ($name) = @_;
if (exists $author_name_canon{$name}) {
return $author_name_canon{$name};
}
return $name;
}
sub new {
my $class = shift;
my $self = bless {
PARENT => [],
TREE => undef,
AUTHOR => undef,
COMMITTER => undef,
}, $class;
my $commit_sha1 = shift;
$self->{SHA1} = $commit_sha1;
my $fh = undef;
open $fh, '-|', 'git-cat-file', 'commit', $commit_sha1
or die "cannot read commit object $commit_sha1";
while (<$fh>) {
chomp;
if (/^tree ([0-9a-f]{40})$/) { $self->{TREE} = $1; }
elsif (/^parent ([0-9a-f]{40})$/) { push @{$self->{PARENT}}, $1; }
elsif (/^author ([^>]+>)/) {
$self->{AUTHOR} = canon_author_name($1);
}
elsif (/^committer ([^>]+>)/) {
$self->{COMMITTER} = canon_author_name($1);
}
}
close $fh
or die "failure while closing pipe to git-cat-file";
return $self;
}
sub find_file {
my ($commit, $path) = @_;
my $result = undef;
my $fh = undef;
local ($/) = "\0";
open $fh, '-|', 'git-ls-tree', '-z', '-r', '-d', $commit->{TREE}, $path
or die "cannot read git-ls-tree $commit->{TREE}";
while (<$fh>) {
chomp;
if (/^[0-7]{6} blob ([0-9a-f]{40}) (.*)$/) {
if ($2 ne $path) {
die "$2 ne $path???";
}
$result = $1;
last;
}
}
close $fh
or die "failure while closing pipe to git-ls-tree";
return $result;
}
package Git::Blame;
sub new {
my $class = shift;
my $self = bless {
LINE => [],
UNKNOWN => undef,
WORK => [],
}, $class;
my $commit = shift;
my $filename = shift;
my $sha1 = $commit->find_file($filename);
my $blob = ::read_blob($sha1);
my @blob = (split(/\n/, $blob));
for (my $i = 0; $i < @blob; $i++) {
$self->{LINE}[$i] = +{
COMMIT => $commit,
FOUND => undef,
FILENAME => $filename,
LINENO => ($i + 1),
};
}
$self->{UNKNOWN} = scalar @blob;
push @{$self->{WORK}}, [$commit, $filename];
return $self;
}
sub read_blame_cache {
my $self = shift;
my $filename = shift;
my $fh = undef;
my $pi = $self->{'PATHINFO'} = {};
open $fh, '<', $filename;
while (<$fh>) {
chomp;
my ($commit, $parent, $path) = split(/\t/, $_);
$pi->{$path}{$commit}{$parent} = 1;
}
close $fh;
}
sub print {
my $self = shift;
my $line_termination = shift;
for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
my $l = $self->{LINE}[$i];
print ($l->{FOUND} ? ':' : '?');;
print "$l->{COMMIT}->{SHA1} ";
print "$l->{COMMIT}->{AUTHOR} ";
print "$l->{COMMIT}->{COMMITTER} ";
print "$l->{LINENO} $l->{FILENAME}";
print $line_termination;
}
}
sub take_responsibility {
my ($self, $commit) = @_;
for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
my $l = $self->{LINE}[$i];
if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
$l->{FOUND} = 1;
$self->{UNKNOWN}--;
}
}
}
sub blame_parent {
my ($self, $commit, $parent, $filename) = @_;
my @diff = ::read_diff_raw($parent->{SHA1}, $filename);
my $filename_in_parent;
my $passed_blame_to_parent = undef;
if (@diff == 0) {
# We have not touched anything. Blame parent for everything
# that we are suspected for.
for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
my $l = $self->{LINE}[$i];
if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
$l->{COMMIT} = $parent;
$passed_blame_to_parent = 1;
}
}
$filename_in_parent = $filename;
}
elsif (@diff != 1) {
# This should not happen.
for (@diff) {
print "** @$_\n";
}
die "Oops";
}
else {
my ($status, $sha1_1, $sha1_2, $file1, $file2) = @{$diff[0]};
print STDERR "** $status $file1 $file2\n" if $::debug;
if ($status =~ /N/ || $status =~ /M[0-9][0-9]/) {
# Either some of other parents created it, or we did.
# At this point the only thing we know is that this
# parent is not responsible for it.
;
}
else {
my $patch = Git::Patch->new($sha1_1, $sha1_2);
$filename_in_parent = $file1;
for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
my $l = $self->{LINE}[$i];
if (! $l->{FOUND} && $l->{COMMIT}->{SHA1} eq $commit->{SHA1}) {
# We are suspected to have introduced this line.
# Does it exist in the parent?
my $lineno = $l->{LINENO};
my $parent_line = $patch->find_parent_line($lineno);
if ($parent_line < 0) {
# No, we may be the guilty ones, or some other
# parent might be. We do not assign blame to
# ourselves here yet.
;
}
else {
# This line is coming from the parent, so pass
# blame to it.
$l->{COMMIT} = $parent;
$l->{FILENAME} = $file1;
$l->{LINENO} = $parent_line;
$passed_blame_to_parent = 1;
}
}
}
}
}
if ($passed_blame_to_parent && $self->{UNKNOWN}) {
unshift @{$self->{WORK}},
[$parent, $filename_in_parent];
}
}
sub assign {
my ($self, $commit, $filename) = @_;
# We do read-tree of the current commit and diff-cache
# with each parents, instead of running diff-tree. This
# is because diff-tree does not look for copies hard enough.
if (exists $self->{'PATHINFO'} && exists $self->{'PATHINFO'}{$filename} &&
!exists $self->{'PATHINFO'}{$filename}{$commit->{SHA1}} &&
@{$commit->{PARENT}} == 1) {
# This commit did not touch the path at all, and
# has only one parent. It is all that parent's fault.
my $parent = Git::Commit->new($commit->{PARENT}[0]);
my $passed_blame_to_parent = 0;
for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
my $l = $self->{LINE}[$i];
if (! $l->{FOUND} &&
($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
$l->{COMMIT} = $parent;
$passed_blame_to_parent = 1;
}
}
if ($passed_blame_to_parent && $self->{UNKNOWN}) {
unshift @{$self->{WORK}},
[$parent, $filename];
}
return;
}
print STDERR "* read-tree $commit->{SHA1}\n" if $::debug;
system('git-read-tree', '-m', $commit->{SHA1});
for my $parent (@{$commit->{PARENT}}) {
$self->blame_parent($commit, Git::Commit->new($parent), $filename);
}
$self->take_responsibility($commit);
}
sub assign_blame {
my ($self) = @_;
while ($self->{UNKNOWN} && @{$self->{WORK}}) {
my $wk = shift @{$self->{WORK}};
my ($commit, $filename) = @$wk;
$self->assign($commit, $filename);
}
}
################################################################
package main;
my $usage = "blame [-z] <commit> filename";
my $line_termination = "\n";
$::ENV{GIT_INDEX_FILE} = "/tmp/blame-$$-index";
unlink($::ENV{GIT_INDEX_FILE});
if ($ARGV[0] eq '-z') {
$line_termination = "\0";
shift;
}
if (@ARGV != 2) {
die $usage;
}
my $head_commit = Git::Commit->new($ARGV[0]);
my $filename = $ARGV[1];
my $blame = Git::Blame->new($head_commit, $filename);
if (-f ".blame-cache") {
$blame->read_blame_cache(".blame-cache");
}
$blame->assign_blame();
$blame->print($line_termination);
unlink($::ENV{GIT_INDEX_FILE});
__END__
How does this work, and what do we do about merges?
The algorithm considers that the first parent is our main line of
development and treats it somewhat special than other parents. So we
pass on the blame to the first parent if a line has not changed from
it. For lines that have changed from the first parent, we must have
either inherited that change from some other parent, or it could have
been merge conflict resolution edit we did on our own.
The following picture illustrates how we pass on and assign blames.
In the sample, the original O was forked into A and B and then merged
into M. Line 1, 2, and 4 did not change. Line 3 and 5 are changed in
A, and Line 5 and 6 are changed in B. M made its own decision to
resolve merge conflicts at Line 5 to something different from A and B:
A: 1 2 T 4 T 6
/ \
O: 1 2 3 4 5 6 M: 1 2 T 4 M S
\ /
B: 1 2 3 4 S S
In the following picture, each line is annotated with a blame letter.
A lowercase blame (e.g. "a" for "1") means that commit or its ancestor
is the guilty party but we do not know which particular ancestor is
responsible for the change yet. An uppercase blame means that we know
that commit is the guilty party.
First we look at M (the HEAD) and initialize Git::Blame->{LINE} like
this:
M: 1 2 T 4 M S
m m m m m m
That is, we know all lines are results of modification made by some
ancestor of M, so we assign lowercase 'm' to all of them.
Then we examine our first parent A. Throughout the algorithm, we are
always only interested in the lines we are the suspect, but this being
the initial round, we are the suspect for all of them. We notice that
1 2 T 4 are the same as the parent A, so we pass the blame for these
four lines to A. M and S are different from A, so we leave them as
they are (note that we do not immediately take the blame for them):
M: 1 2 T 4 M S
a a a a m m
Next we go on to examine parent B. Again, we are only interested in
the lines we are still the suspect (i.e. M and S). We notice S is
something we inherited from B, so we pass the blame on to it, like
this:
M: 1 2 T 4 M S
a a a a m b
Once we exhausted the parents, we look at the results and take
responsibility for the remaining ones that we are still the suspect:
M: 1 2 T 4 M S
a a a a M b
We are done with M. And we know commits A and B need to be examined
further, so we do them recursively. When we look at A, we again only
look at the lines that A is the suspect:
A: 1 2 T 4 T 6
a a a a M b
Among 1 2 T 4, comparing against its parent O, we notice 1 2 4 are
the same so pass the blame for those lines to O:
A: 1 2 T 4 T 6
o o a o M b
A is a non-merge commit; we have already exhausted the parents and
take responsibility for the remaining ones that A is the suspect:
A: 1 2 T 4 T 6
o o A o M b
We go on like this and the final result would become:
O: 1 2 3 4 5 6
O O A O M B
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-04 23:06 ` Junio C Hamano
@ 2005-07-05 1:54 ` Daniel Barkalow
2005-07-05 6:24 ` Junio C Hamano
0 siblings, 1 reply; 39+ messages in thread
From: Daniel Barkalow @ 2005-07-05 1:54 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, git
On Mon, 4 Jul 2005, Junio C Hamano wrote:
> >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
>
> DB> Junio: what's missing from annotate/blame?
>
> Which one are you talking about?
>
> What I use to generate http://members.cox.net/junkio/Summary.txt
> is an implementation of an algorithm I consider "complete" in
> that it does rename/copy and complete rewrite correctly. What
> is missing from the implementation is efficiency.
[perl script]
> How does this work, and what do we do about merges?
I've got that part, but I'm not clear on how the rename/copy and complete
rewrite stuff works.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-05 1:54 ` Daniel Barkalow
@ 2005-07-05 6:24 ` Junio C Hamano
2005-07-05 13:34 ` Marco Costalba
0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2005-07-05 6:24 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Linus Torvalds, git
>>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
DB> [perl script]
>> How does this work, and what do we do about merges?
DB> I've got that part, but I'm not clear on how the rename/copy and complete
DB> rewrite stuff works.
Rename/copy is simply about what files to use when comparing
between two trees. If you are tracking file F and find that a
commit created that file by renaming what was originally G, then
you compare old G and this F and assign blame for common lines
to the ancestor of that commit (that had those lines in G), and
take responsibility of the rest yourself (or assign blame for
them to other parents).
Rewrite is easier. If diff-tree -B says it is a complete
rewrite, instead of assigning blame for lines that you are the
current suspect to any of your parents, you take blame for all
of them yourself, without comparing the file with the parent to
assign blame line-by-line.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again
2005-07-05 6:24 ` Junio C Hamano
@ 2005-07-05 13:34 ` Marco Costalba
0 siblings, 0 replies; 39+ messages in thread
From: Marco Costalba @ 2005-07-05 13:34 UTC (permalink / raw)
To: Junio C Hamano, Daniel Barkalow; +Cc: Linus Torvalds, git
--- Junio C Hamano <junkio@cox.net> wrote:
> >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes:
>
> DB> [perl script]
>
> >> How does this work, and what do we do about merges?
>
Checking diffs of all the parents can be computational expensive.
I'am developing a different alghoritm for qgit, where I know,
before to run annotate all the revs that changed a given file.
The revs are saved in the shaHist list according to git-rev-list --merge-order.
The algoritm operates form the oldest to the newst calculating annotation for each
following rev.
This alghoritm takes advantage of the fact that mostly revs belong to 2 categories:
1) revs of the same development-sequence (e.g. connected without merges)
2) empty merges, i.e. merges where the file we are interested in
is listed in only one merging branch
If a rev in shaHist does not belong to above categories a reach analisys must be performed
in order to find the direct ancestor among all the previous revs in the list.
So starting from shaHist a ReachList rl is populated by getReachability():
void Annotate::getReachability(ReachList& rl, const QString& file, const QStringList& shaHist) {
rl.clear();
QStringList::const_iterator it = shaHist.begin();
Rev* r = revLookup(*it);
rl.append(ReachInfo(*it, r->id, INITIAL));
for (it++; it != shaHist.end(); it++) {
Rev* r = revLookup(*it);
// initial revision case
if (r->parents.count() == 0) {
rl.append(ReachInfo(*it, r->id, INITIAL));
continue;
}
// merge case
if (r->parents.count() > 1) {
// check empty merges diffing against previous in list
QString prevSha = rl.last().sha;
QString d = getFileDiff(*it, prevSha, file);
if (d.isEmpty()) {
rl.append(ReachInfo(*it, r->id, EMPTY_MERGE));
rl.last().roots.append(prevSha);
} else {
// real merge: we need reach test.
QStringList roots;
roots.clear();
for (uint i = 0; i < r->parents.count(); i++) {
// here tree walking is performed
QString root = getRoot(r->parents[i], rl);
if (!root.isEmpty())
roots.append(root);
}
rl.append(ReachInfo(*it, r->id, MERGE));
rl.last().roots = roots;
}
}
continue;
}
// normal case: a revision with a single parent
if (r->id == rl.last().id) { // same development sequence
QString prevSha = rl.last().sha;
rl.append(ReachInfo(*it, r->id, SAME_BR)); // same branch
rl.last().roots.append(prevSha);
} else { // different development sequence
QString root = getRoot(*it, rl); // <-- here tree walking is performed
if (!root.isEmpty()) {
rl.append(ReachInfo(*it, r->id, DIFF_BR));
rl.last().roots.append(root);
} else
qDebug("ASSERT getReachability: rev %s not reachable",
(*it).latin1());
}
}
}
Then ReachList rl is then used to compute correct annotations:
void Annotate::run(const QString& file, const QStringList& shaHist, AnnotateHistory& ah) {
QString d, diffTarget;
QStringList t;
int pos;
ReachList rl;
ah.clear();
getReachability(rl, file, shaHist); // <-- here ReachList rl is calculated
for (uint i = 0; i < rl.count(); i++) {
switch(rl[i].type) { // <-- here ReachList rl is used to find annotations
case SAME_BR:
case DIFF_BR:
diffTarget = rl[i].roots.first();
d = getFileDiff(rl[i].sha, diffTarget, file);
pos = shaHist.findIndex(diffTarget);
ah.append(processDiff(d, ah[pos], getAuthor(rl[i].sha, shaHist)));
break;
case INITIAL:
pos = shaHist.findIndex(rl[i].sha);
ah.append(getFirstAnnotation(file, shaHist, pos));
break;
case EMPTY_MERGE:
diffTarget = rl[i].roots.first();
pos = shaHist.findIndex(diffTarget);
t = ah[pos]; // copy annotation from previous
ah.append(t);
break;
case MERGE:
// append annotation calculated on first root
diffTarget = rl[i].roots.first();
d = getFileDiff(rl[i].sha, diffTarget, file);
pos = shaHist.findIndex(diffTarget);
ah.append(processDiff(d, ah[pos], "Merge"));
// update annotation with all the others roots
for (uint j = 1; j < rl[i].roots.count(); j++) {
diffTarget = rl[i].roots[j];
d = getFileDiff(rl[i].sha, diffTarget, file);
pos = shaHist.findIndex(diffTarget);
// create an annotation calculated between root[i] and
// the merge.
QStringList scndAnn = processDiff(d, ah[pos],
getAuthor(diffTarget, shaHist));
// finally we unify the two annotations, so we catch
// also the case of a 3-way merge
unify(ah.last(), scndAnn);
}
break;
}
}
}
Advantage of this algorithm is that reach analisys, i.e. tree walking, is
done, statistically, only for a small subset of revs.
Another side benefit is that correct annotation is calculated for each rev and
not only for newest one.
Of course there are issues:
1) A ordered list of revs (file history) must be known in advance. This is not a problem
for qgit because file history is calculated while loading revs.
2) Does not takes in account file renames.
Hope this notes can help in ongoing discussion
Marco
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2005-07-05 13:35 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25 4:40 ` Jeff Garzik
2005-06-25 5:23 ` Linus Torvalds
2005-06-25 5:48 ` Jeff Garzik
2005-06-25 6:16 ` Linus Torvalds
2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
2005-06-26 19:19 ` Linus Torvalds
2005-06-26 19:45 ` Junio C Hamano
[not found] ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
[not found] ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
[not found] ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03 ` Chris Mason
2005-06-26 21:40 ` Linus Torvalds
2005-06-26 22:34 ` Linus Torvalds
2005-06-28 18:06 ` Nicolas Pitre
2005-06-28 19:28 ` Linus Torvalds
2005-06-28 21:08 ` Nicolas Pitre
2005-06-28 21:27 ` Linus Torvalds
2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29 5:16 ` Nicolas Pitre
2005-06-29 5:43 ` Linus Torvalds
2005-06-29 5:54 ` Linus Torvalds
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15 ` Linus Torvalds
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45 ` Junio C Hamano
2005-07-04 21:59 ` Linus Torvalds
2005-07-04 22:41 ` Daniel Barkalow
2005-07-04 23:06 ` Junio C Hamano
2005-07-05 1:54 ` Daniel Barkalow
2005-07-05 6:24 ` Junio C Hamano
2005-07-05 13:34 ` Marco Costalba
2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2005-07-03 2:51 linux
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).