* kernel.org and GIT tree rebuilding @ 2005-06-25 4:20 David S. Miller 2005-06-25 4:40 ` Jeff Garzik 2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano 0 siblings, 2 replies; 39+ messages in thread From: David S. Miller @ 2005-06-25 4:20 UTC (permalink / raw) To: git To get a clean history to push to Linus, I typically blow away my trees and make fresh ones to stick patches into which I want to merge. That mostly works fine here on my local systems, but I know this brings the master.org mirroring system to it's knees. So what is the generally condoned way to do stuff like this in a more friendly way? Should I: 1) Do a git pull from Linus's tree once he takes my changes, then ask GIT to prune the tree? How do I do that and how does it work? 2) Should I use .git/object/ database symlinking? Are there any scripts out there which do this automatically? Something as simple to run as "git-pull-script" and it takes care of using links when possible on a local filesystem. It takes sometimes an hour for my tree updates on master.kernel.org to propagate to rsync.kernel.org so I can ask Linus to pull. That's crazy. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller @ 2005-06-25 4:40 ` Jeff Garzik 2005-06-25 5:23 ` Linus Torvalds 2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano 1 sibling, 1 reply; 39+ messages in thread From: Jeff Garzik @ 2005-06-25 4:40 UTC (permalink / raw) To: David S. Miller; +Cc: git David S. Miller wrote: > To get a clean history to push to Linus, I typically blow > away my trees and make fresh ones to stick patches into > which I want to merge. > > That mostly works fine here on my local systems, but I know this > brings the master.org mirroring system to it's knees. So what is the > generally condoned way to do stuff like this in a more friendly way? > > Should I: > > 1) Do a git pull from Linus's tree once he takes my changes, then > ask GIT to prune the tree? How do I do that and how does it work? > It takes sometimes an hour for my tree updates on master.kernel.org > to propagate to rsync.kernel.org so I can ask Linus to pull. > That's crazy. Unfortunately you cannot fix this by changing your actions. This is the cumulative effect of all the git kernel trees on kernel.org. It now takes over an hour for my non-git changes to propagate from master to the mirrors, as well. This is all due to the rsync sweeps, which have to scan metric tons of inodes and dentries. Orders of magnitude over the pre-git days. ftpadmin@kernel.org folks are supposedly working on an inotify-based system, and an improved rsync application. No ETA or details. As an aside, cold-cache, git really punishes my disks. Ted T'so noted that it really drains laptop batteries, too. > 2) Should I use .git/object/ database symlinking? > > Are there any scripts out there which do this automatically? > Something as simple to run as "git-pull-script" and it takes > care of using links when possible on a local filesystem. On both kernel.org and locally, I use 'cp -al' to duplicate the initial .git/objects directory, and then rsync (->kernel.org) or git-pull-script (<-kernel.org) to update it after that. That definitely helps. Maybe somebody needs to script a relink cron job for kernel.org? Jeff ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 4:40 ` Jeff Garzik @ 2005-06-25 5:23 ` Linus Torvalds 2005-06-25 5:48 ` Jeff Garzik 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-25 5:23 UTC (permalink / raw) To: Jeff Garzik; +Cc: David S. Miller, git On Sat, 25 Jun 2005, Jeff Garzik wrote: > > This is all due to the rsync sweeps, which have to scan metric tons of > inodes and dentries. Orders of magnitude over the pre-git days. Well, the real solution is to use a git-aware protocol, not rsync. rsync is wonderful for prototyping, and I wanted to make the database rsync'able for that reason, but it clearly doesn't scale. I think I'll make a "pack"/"unpack" pair that just packs all the necessary objects between two commits. Then you can basically sync the object file by doing git-pack OLD..NEW | ssh other-end git-unpack and you'd basically be done. It looks pretty easy to do, too.. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 5:23 ` Linus Torvalds @ 2005-06-25 5:48 ` Jeff Garzik 2005-06-25 6:16 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jeff Garzik @ 2005-06-25 5:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, git Linus Torvalds wrote: > > On Sat, 25 Jun 2005, Jeff Garzik wrote: > >>This is all due to the rsync sweeps, which have to scan metric tons of >>inodes and dentries. Orders of magnitude over the pre-git days. > > > Well, the real solution is to use a git-aware protocol, not rsync. > > rsync is wonderful for prototyping, and I wanted to make the database > rsync'able for that reason, but it clearly doesn't scale. > > I think I'll make a "pack"/"unpack" pair that just packs all the necessary > objects between two commits. Then you can basically sync the object file > by doing > > git-pack OLD..NEW | ssh other-end git-unpack > > and you'd basically be done. It looks pretty easy to do, too.. The problem is kernel.org mirroring, not individual pushes and pulls, really. Would git-pack be the best solution for mirroring a bunch of git trees? Jeff ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 5:48 ` Jeff Garzik @ 2005-06-25 6:16 ` Linus Torvalds 2005-06-26 16:41 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-25 6:16 UTC (permalink / raw) To: Jeff Garzik; +Cc: David S. Miller, git On Sat, 25 Jun 2005, Jeff Garzik wrote: > > The problem is kernel.org mirroring, not individual pushes and pulls, > really. > > Would git-pack be the best solution for mirroring a bunch of git trees? No guarantees, but here's a rough plan: - I just committed a fairly trivial change to add a "--objects" flag to git-rev-list, which allows you to basically say "I want to see the difference not just in commit ID's, but also trees and blobs" What does that mean? It means that in a mirroring schenario, you can, for each git tree, do: (a) On the slave: cat .git/refs/*/* | sort | uniq > slave-ref-list (b) On the master: cat .git/refs/*/* | sort | uniq > master-ref-list (c) On the master: cmp $master-ref-list $slave-ref-list && exit 1 list=$(cat master-ref-list) for i in $(cat slave-ref-list) do list=$list ^$i done git-rev-list --objects $list and now that "git-rev-list" will list every object that needs to be copied from the master to the slave. No need to read huge directories etc, you get the list computed for you. yeah, it clearly needs some refining to be useful, but I think you can kind of see how it would work. Now, the secondary advantage of this is that once you don't use rsync as the mirroring method, you can now change the filesystem object database layout. In particular, the packing thing that Chris Mason was working on at some point suddenly becomes a lot more viable. (In fact, more than that. You can make a single packed blob for all "historical" objects, and that also gives you an efficient archive format - if you're not required to have the full filesystem layout, you could have a much more efficient packing that you basically do once a week or something, so that you only keep the last week in the regular "one file per object" format). Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 6:16 ` Linus Torvalds @ 2005-06-26 16:41 ` Linus Torvalds 2005-06-26 18:39 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Linus Torvalds @ 2005-06-26 16:41 UTC (permalink / raw) To: Jeff Garzik; +Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason On Fri, 24 Jun 2005, Linus Torvalds wrote: > > yeah, it clearly needs some refining to be useful, but I think you can > kind of see how it would work. Ok, here's how it works. - Pick a starting commit (or a hundred) - Pick an ending commit (or a hundred) - generate the list of objects in between them git-rev-list --object end ^start > object-list - Pack that list of objects into an "object pack": git-pack-objects out < object-list (This actually generates two files: "out.idx" is the index file, "out.pack" is the data file, but I'll make it concatenate the two at some point) - move the pack-files over somewhere else - unpack them git-unpack-objects out and you're done. Now, the reason I use "pack" and "unpack" instead of just "tar" to transport the objects is that this allows me to do a fairly efficient packing. I wanted these pack-files to be independent (ie they do _not_ depend on any objects outside of the pack-file), but within the objects described in the pack I cna do delta-compression. Now, that doesn't much help for small updates (where the objects are just unrelated and have no deltas), but it helps increasingly for big ones. The biggest one obviously being the whole path from the start to the HEAD.. For example, the "du -sh .git/objects" for the git project itself is 17MB for me, and I can do: torvalds@ppc970:~/git> du -sh .git/objects 17M .git/objects torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects out Packing 3656 objects real 0m3.779s user 0m3.169s sys 0m0.602s torvalds@ppc970:~/git> ls -lh out.* -rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:12 out.idx -rw-rw-r-- 1 torvalds torvalds 2.0M Jun 26 09:12 out.pack ie it packs down to a nice 2MB pack-file with a small index. Move that over to somewhere else, and unpack it, and you'll get all the regular objects (it doesn't move tags and refs over, you'll have to do that outside of the packing). Now, you can trade off some packing time to get a better pack: torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 out Packing 3656 objects real 0m11.953s user 0m11.294s sys 0m0.663s torvalds@ppc970:~/git> ls -lh out.* -rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:14 out.idx -rw-rw-r-- 1 torvalds torvalds 1.6M Jun 26 09:14 out.pack and if you want to allow deep delta chains (the default delta depth limiting is 10), you can get even better results: torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 --depth=100 out Packing 3656 objects real 0m12.374s user 0m11.704s sys 0m0.659s torvalds@ppc970:~/git> ls -lh out.* -rw-rw-r-- 1 torvalds torvalds 87K Jun 26 09:16 out.idx -rw-rw-r-- 1 torvalds torvalds 1.3M Jun 26 09:16 out.pack but then unpacking will slightly heavier. (Doing the same for the kernel is obviously much more expensive just because the kernel is so much bigger. A big delta discovery window like 100 takes about fifteen minutes to pack on my machine, but gets the current kernel archive down to 70MB or so. That's ok for a monthly "pack all the objects" to keep size requirements down, but you clearly don't want to do this all the time ;). Now, perhaps the more interesting part is that I also designed the pack format so that it should be a good "history" format, not just a way to move objects from one place to the other. Ie if you worry about diskspace, you can pack everything up to the now into one big pack, and then remove the original objects. Don't do that yet, btw - I haven't actually written the code to read stuff out of packs if we don't find it in the object directory yet, but the layout is such that it should be straightforward and pretty efficient (but there a deep delta chain obviously _will_ cause a performance hit). I actually like this approach better than having delta-objects in the filesystem. Partly because the pack-file is self-contained, partly because it also solves the fs blocking issue, yet is still efficient to look up the results without having hardlinks etc to duplicate objects virtually. And when you do the packing by hand as an "archival" mechanism, it also doesn't have any of the downsides that Chris' packing approach had. Nico? Chris? Interested in giving it a look? It's kind of a combination of your things, generalized and then made to have fast lookup with the index. Fast lookup doesn't matter for a normal unpack, of course, and if I just always wanted to unpack all the objects (ie just an object transfer mechanism) I'd have made the index be a toposort of the objects. But because I wanted to be able to use it as an archival format, I needed it to be "random-access" by object name. So the index is in fact a binary tree (well, sorted array, so the lookup degenerates into a binary search) with a top-level index splitting up the contents based on the first byte (the same way the filesystem layout does). Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 16:41 ` Linus Torvalds @ 2005-06-26 18:39 ` Junio C Hamano 2005-06-26 19:19 ` Linus Torvalds 2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason 2005-06-28 18:06 ` Nicolas Pitre 2 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2005-06-26 18:39 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> I actually like this approach better than having delta-objects in the LT> filesystem. Partly because the pack-file is self-contained, partly because LT> it also solves the fs blocking issue, yet is still efficient to look up LT> the results without having hardlinks etc to duplicate objects virtually. LT> And when you do the packing by hand as an "archival" mechanism, it also LT> doesn't have any of the downsides that Chris' packing approach had. After analyzing what is involved in making packed GIT integrated into read_sha1_file() [*1*], I agree 100% with the above. I mean no disrespect to what Nico has done (and I myself have done some code to work with Nico's deltified objects when I did diffs and pull fixes), but it would help the code very much if we do not have to worry about "delta" objects in GIT_OBJECT_DIRECTORY. My preference is to do things in this order: (0) concatenate pack and idx files; (1) teach read_sha1_file() to read from packed GIT; (2) teach fsck-cache about packed GIT; (3) have people with deltified repositories convert them back to undeltified (I think git-pack-objects would barf on such repository); (4) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means that git-deltafy-script and git-mkdelta have to go. (5) tell git-*-pull about packed GIT; [Footnotes] *1* Here is the analysis I did last night, still assuming that we would support "delta" objects in GIT_OBJECT_DIRECTORY. The "trickier" map_sha1_file() users almost all involve "delta" objects, and that is why I prefer dropping them. - Enhance GIT_ALTERNATE_OBJECT_DIRECTORIES mechanism so that its component can be either a directory or a packed file. - sha1_file.c::find_sha1_file() has to be enhanced to express not just path (in the current "individual object file" case) but a pointer to a structure that describes a packed file in the GIT_ALTERNATE_OBJECT_DIRECTORIES list with the offset for the entry. - The change necessary to sha1_file.c::has_sha1_file() is minimum. find_sha1_file() updated along the above lines would say if the thing exists or not anyway, so it can just return true/false as it currently does pretty easily. - sha1_file.c::read_sha1_file() would be the primary piece to unpack from the packed representation. - sha1_file.c::map_sha1_file() is trickier. It has handful callers outside sha1_file.c for valid reasons, so we will need to audit the callers and have them fall back on read_sha1_file() as appropriate. Here is the result of my first pass: - (easy) sha1_delta_base() is used only when an object is delitified, and if true get to the base object. We can just tell the caller our object is not deltified when it resides in a packed file. - (easy) sha1_file_size() is used by diffcore to measure the expanded blob size. Although the implementation obviously has to be different, it would be trivial to find the size if the object resides in a packed file. - (easy) pack-objects.c::check_object() uses map_sha1_file() so that it can unpack small to get the type of the object. We should be able to introduce a new interface (say, sha1_file.c::sha1_object_type()) for doing this sort of stuff. - (harder) mkdelta.c::get_buffer(), object.c::parse_object() and delta.c::process_delta() are trickier, because they want to treat "delta" as a raw object (otherwise we would have just done sha1_read_file() instead of map/unpack_sha1_file pair). - (harder) ssh-push.c::serve_object() also wants raw representation to directly ship to the other end. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 18:39 ` Junio C Hamano @ 2005-06-26 19:19 ` Linus Torvalds 2005-06-26 19:45 ` Junio C Hamano 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-26 19:19 UTC (permalink / raw) To: Junio C Hamano Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason On Sun, 26 Jun 2005, Junio C Hamano wrote: > > My preference is to do things in this order: > > (0) concatenate pack and idx files; Actually, I was originally planning to do that, but now that I have thought about what read_sha1_file() would actually do, I think it's more efficient to leave the index as a separate file. In particular, what you'd normally do is that if you can't look up the file in the regular object directory, you start going through the pack files. You can do it by having GIT_ALTERNATE_OBJECT_DIRECTORIES point to a pack file, but I actually would prefer the notion of just adding a .git/objects/pack subdirectory, and having object lookup just automatically open and map all index files in that subdirectory. And the thing is, you really just want to map the index files, the data files can be so big that you can't afford to map them (ie a really big project might have several pack-files a gig each or something like that). And the most efficient way to map just the index file is to keep it separate, because then the "stat()" will just get the information directly, and you then just mmap that. The alternative is to first read the index of the index (to figure out how big the index is), and then map the rest. But that just seems a lot messier than just mapping the index file directly. And when creating these things, we do need to create the data file (which can be big enough that it doesn't fit in memory) first, so we have to have a separate file for it, we can't just stream it out to stdout. Now, when _sending_ the pack-files, linearizing them is easy: you just send the index first, and the data file immediately afterwards. The index tells how big it is, so there's no need to even add any markers: you can do something like 'git-send-script' with something simple like git-rev-list ... | git-pack-file tmp-pack && cat tmp-pack.idx tmp-pack.data | ssh other git-receive-script So let's just keep the index/data files separate. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 19:19 ` Linus Torvalds @ 2005-06-26 19:45 ` Junio C Hamano [not found] ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net> 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2005-06-26 19:45 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, Git Mailing List, Nicolas Pitre, Chris Mason >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On Sun, 26 Jun 2005, Junio C Hamano wrote: >> >> My preference is to do things in this order: >> >> (0) concatenate pack and idx files; LT> So let's just keep the index/data files separate. Fair enough. Having thought about it a bit more, if people agree, I think it would make more sense to rip out "delta" object support first before doing read_sha1_file() and friends that uses .git/objects/pack. My "preferred order" now look like this: (1) have people with deltified repositories convert them back to undeltified (I think git-pack-objects would barf on such repository); (2) drop "delta" objects from GIT_OBJECT_DIRECTORY; this means that git-deltafy-script and git-mkdelta have to go. (3) teach read_sha1_file() to read from packed GIT in .git/objects/pack; (4) teach fsck-cache about packed GIT; (5) tell git-*-pull about packed GIT; ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>]
[parent not found: <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>]
[parent not found: <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>]
* [PATCH] Obtain sha1_file_info() for deltified pack entry properly. [not found] ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net> @ 2005-06-28 6:56 ` Junio C Hamano 2005-06-28 6:58 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-28 6:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: git I will be sending these three patches: [PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly. [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'. [PATCH 3/3] git-cat-file: '-s' to find out object size. The first one is slightly different from what I sent earlier to you privately. If you have already applied it, please apply the 4-liner alternate patch attached to this message on top of it for the fix included in the one in this series (and drop the first one, obviously). The second and third patches fell out as a bonus while I was debugging the sha1_file_info(). Especially the third one is in "because we can do it so cheaply now", not "because I need to have that feature" category, and I do not mind too much if you drop it, but I suspect somebody may find it useful. The "4-liner alternate patch" follows. ------------ Add missing use_packed_git() call. The function sha1_object_info() was using packed GIT file without making sure it is mapped, which resulted in segfaulting. We would need to introduce unuse_packed_git() call and do proper use counting to figure out when it is safe to unmap, but currently we do not unmap packed file yet. Signed-off-by: Junio C Hamano <junkio@cox.net> --- sha1_file.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -675,6 +675,10 @@ static int packed_object_info(struct pac offset = entry->offset; if (p->pack_size - 5 < offset) die("object offset outside of pack file"); + + if (use_packed_git(p)) + die("cannot map packed file"); + pack = p->pack_base + offset; size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4]; left = p->pack_size - offset - 5; ------------------------------------------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] Obtain sha1_file_info() for deltified pack entry properly. 2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano @ 2005-06-28 6:58 ` Junio C Hamano 2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano 2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano 2 siblings, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-28 6:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: git [PATCH 1/3] Obtain sha1_file_info() for deltified pack entry properly. The initial one was not doing enough to figure things out without uncompressing too much. It also fixes a potential segfault resulting from missing use_packed_git() call. We would need to introduce unuse_packed_git() call and do proper use counting to figure out when it is safe to unmap, but currently we do not unmap packed file yet. Signed-off-by: Junio C Hamano <junkio@cox.net> --- sha1_file.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 files changed, 69 insertions(+), 4 deletions(-) d2f58b4aef500835489f30ac5df7985bc21e3c24 diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -601,9 +601,70 @@ void * unpack_sha1_file(void *map, unsig return unpack_sha1_rest(&stream, hdr, *size); } -/* Returns 0 on fast-path success, returns 1 on deltified - * and need to unpack to see info. - */ +static int packed_delta_info(unsigned char *base_sha1, + unsigned long delta_size, + unsigned long left, + char *type, + unsigned long *sizep) +{ + unsigned char *data; + unsigned char delta_head[64]; + int i; + unsigned char cmd; + unsigned long data_size, result_size, base_size, verify_base_size; + z_stream stream; + int st; + + if (left < 20) + die("truncated pack file"); + if (sha1_object_info(base_sha1, type, &base_size)) + die("cannot get info for delta-pack base"); + + data = base_sha1 + 20; + data_size = left - 20; + + memset(&stream, 0, sizeof(stream)); + + stream.next_in = data; + stream.avail_in = data_size; + stream.next_out = delta_head; + stream.avail_out = sizeof(delta_head); + + inflateInit(&stream); + st = inflate(&stream, Z_FINISH); + inflateEnd(&stream); + if ((st != Z_STREAM_END) && stream.total_out != sizeof(delta_head)) + die("delta data unpack-initial failed"); + + /* Examine the initial part of the delta to figure out + * the result size. Verify the base size while we are at it. + */ + data = delta_head; + verify_base_size = i = 0; + cmd = *data++; + while (cmd) { + if (cmd & 1) + verify_base_size |= *data++ << i; + i += 8; + cmd >>= 1; + } + + /* Read the result size */ + result_size = i = 0; + cmd = *data++; + while (cmd) { + if (cmd & 1) + result_size |= *data++ << i; + i += 8; + cmd >>= 1; + } + if (verify_base_size != base_size) + die("delta base size mismatch"); + + *sizep = result_size; + return 0; +} + static int packed_object_info(struct pack_entry *entry, char *type, unsigned long *sizep) { @@ -614,12 +675,16 @@ static int packed_object_info(struct pac offset = entry->offset; if (p->pack_size - 5 < offset) die("object offset outside of pack file"); + + if (use_packed_git(p)) + die("cannot map packed file"); + pack = p->pack_base + offset; size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4]; left = p->pack_size - offset - 5; switch (*pack) { case 'D': - return 1; + return packed_delta_info(pack+5, size, left, type, sizep); break; case 'C': strcpy(type, "commit"); ------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t'. 2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano 2005-06-28 6:58 ` Junio C Hamano @ 2005-06-28 6:58 ` Junio C Hamano 2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano 2 siblings, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-28 6:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: git When trying to find out the type of the object, there is no need to uncompress the whole object. Just use sha1_object_info(). Signed-off-by: Junio C Hamano <junkio@cox.net> --- cat-file.c | 10 ++++------ 1 files changed, 4 insertions(+), 6 deletions(-) fcfdd3d74d359af32dd30aa9f4fe373cebb3e232 diff --git a/cat-file.c b/cat-file.c --- a/cat-file.c +++ b/cat-file.c @@ -16,13 +16,11 @@ int main(int argc, char **argv) usage("git-cat-file [-t | tagname] <sha1>"); if (!strcmp("-t", argv[1])) { - buf = read_sha1_file(sha1, type, &size); - if (buf) { - buf = type; - size = strlen(type); - type[size] = '\n'; - size++; + if (!sha1_object_info(sha1, type, &size)) { + printf("%s\n", type); + return 0; } + buf = NULL; } else { buf = read_object_with_reference(sha1, argv[1], &size, NULL); } ------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 3/3] git-cat-file: '-s' to find out object size. 2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano 2005-06-28 6:58 ` Junio C Hamano 2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano @ 2005-06-28 6:59 ` Junio C Hamano 2 siblings, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-28 6:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: git We use sha1_object_info() now, and getting size is also trivial. I admit that this is more of "because we can" not "because I see immediate need for it", though. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Documentation/git-cat-file.txt | 12 +++++++++--- cat-file.c | 13 ++++++++++--- 2 files changed, 19 insertions(+), 6 deletions(-) dcf2fc7609509985d4411fb13c6956bcf3c8560f diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt --- a/Documentation/git-cat-file.txt +++ b/Documentation/git-cat-file.txt @@ -9,12 +9,13 @@ git-cat-file - Provide content or type i SYNOPSIS -------- -'git-cat-file' (-t | <type>) <object> +'git-cat-file' (-t | -s | <type>) <object> DESCRIPTION ----------- Provides content or type of objects in the repository. The type -is required if '-t' is not being used to find the object type. +is required unless '-t' is used to find the object type, +or '-s' is used to find the object size. OPTIONS ------- @@ -25,6 +26,10 @@ OPTIONS Instead of the content, show the object type identified by <object>. +-s:: + Instead of the content, show the object size identified by + <object>. + <type>:: Typically this matches the real type of <object> but asking for a type that can trivially dereferenced from the given @@ -35,7 +40,8 @@ OPTIONS OUTPUT ------ -If '-t' is specified, one of the <type>. +If '-t' is specified, one of the <type>. If '-s' is specified, +the size of the <object> in bytes. Otherwise the raw (though uncompressed) contents of the <object> will be returned. diff --git a/cat-file.c b/cat-file.c --- a/cat-file.c +++ b/cat-file.c @@ -13,11 +13,18 @@ int main(int argc, char **argv) unsigned long size; if (argc != 3 || get_sha1(argv[2], sha1)) - usage("git-cat-file [-t | tagname] <sha1>"); + usage("git-cat-file [-t | -s | tagname] <sha1>"); - if (!strcmp("-t", argv[1])) { + if (!strcmp("-t", argv[1]) || !strcmp("-s", argv[1])) { if (!sha1_object_info(sha1, type, &size)) { - printf("%s\n", type); + switch (argv[1][1]) { + case 't': + printf("%s\n", type); + break; + case 's': + printf("%lu\n", size); + break; + } return 0; } buf = NULL; ------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 16:41 ` Linus Torvalds 2005-06-26 18:39 ` Junio C Hamano @ 2005-06-26 20:52 ` Chris Mason 2005-06-26 21:03 ` Chris Mason 2005-06-26 21:40 ` Linus Torvalds 2005-06-28 18:06 ` Nicolas Pitre 2 siblings, 2 replies; 39+ messages in thread From: Chris Mason @ 2005-06-26 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre On Sunday 26 June 2005 12:41, Linus Torvalds wrote: > On Fri, 24 Jun 2005, Linus Torvalds wrote: > > yeah, it clearly needs some refining to be useful, but I think you can > > kind of see how it would work. > > Ok, here's how it works. > > - Pick a starting commit (or a hundred) > > - Pick an ending commit (or a hundred) > > - generate the list of objects in between them > > git-rev-list --object end ^start > object-list > > - Pack that list of objects into an "object pack": > > git-pack-objects out < object-list Without having read the code, the big thing that hurt performance in my early packed file work was compressing the whole packed file instead of individual sub-objects. It takes more room to compress each object, but when I compressed the whole thing read performance was quite bad. -chris ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason @ 2005-06-26 21:03 ` Chris Mason 2005-06-26 21:40 ` Linus Torvalds 1 sibling, 0 replies; 39+ messages in thread From: Chris Mason @ 2005-06-26 21:03 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre On Sunday 26 June 2005 16:52, Chris Mason wrote: > > > > git-rev-list --object end ^start > object-list > > > > - Pack that list of objects into an "object pack": > > > > git-pack-objects out < object-list > > Without having read the code, the big thing that hurt performance in my > early packed file work was compressing the whole packed file instead of > individual sub-objects. It takes more room to compress each object, but > when I compressed the whole thing read performance was quite bad. Sorry, fat fingered the send key... The hard links were the biggest problem with my packed file patches, I think the dynamic lookup in a separate packed file index is the best way to go. -chris ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason 2005-06-26 21:03 ` Chris Mason @ 2005-06-26 21:40 ` Linus Torvalds 2005-06-26 22:34 ` Linus Torvalds 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-26 21:40 UTC (permalink / raw) To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre On Sun, 26 Jun 2005, Chris Mason wrote: > > Without having read the code, the big thing that hurt performance in my early > packed file work was compressing the whole packed file instead of individual > sub-objects. It takes more room to compress each object, but when I > compressed the whole thing read performance was quite bad. Since I wanted random-access, compressing the whole thing just wasn't an option. Besides, the big space savings come from finding deltas, which is obviously also a compression, just at a higher level. The biggest problem there is to find a guess of objects to try to delta against, and right now that part is pretty stupid and could possibly be improved (it just sorts objects by size and tries to delta against "close" objects). To generate a better sort _would_ actually be pretty close to doing a global compression (it really does boil down to the same thing: finding big sub-sequences, except it's in a "fragmented" space), but one issue is that I don't want to read in the whole data set in one go, so it would have to be based on some rolling hash or something. Davide pointed to rzip, and a variation of that (which knows about object boundaries) might work. (You can also sort by filename, if you want to try. I don't track filenames at all there and it's actually non-trivial to do, so that would require some new and pretty nasty code, but it's possible in _theory_ at least.) Anyway, that's all potential improvement for generating better packing, and it should certainly be possible without changing the format - just generate a better initial sort, in otder to find more deltas (or rather, find them faster by using a smaller window size). So the stupid sort I have now does actually work, but exactly because it's so stupid it wants a big window for best packing (because there might be a lot of objects that aren't interesting), which in turn is quite expensive. So a better sort would make a smaller window more effective. [ Some numbers: a window of 10 objects is the default, and packs the current kernel down to 77MB in 2m21s. A window of 20 objects improves that packing to 71MB, but makes the packing time go up to 3m36s for me. And a window of 100 gets us down to 62M but takes 11m54s. A window of 200 (with a delta depth of 200 too - likely _way_ too deep for normal use) gives you a 59M pack, but takes 20m59s, so there's definitely a point of diminishing returns. This is all for the current HEAD, which takes up 264M the "traditional" git way and takes 141M without any deltas, just packed tightly with no filesystem blocking. Now, as you can notice that's actually a slightly sub-linear increase in time, because as we find a delta, we will only accept smaller deltas in the future, so we can often stop comparing even before we've reached the maximum window size, and so effort is slightly less than linear because there's effectively a constant component to part of it. Also, the good news is that you probably don't want to generate one humungous pack archive anyway, but you're likely better off doing a new incremental pack every few months. So we'll never have the situation that creating a pack gets increasingly more costly, since at some point you just say "ok, I created a perfect pack for the first 4 months of development, I'll now do subsequent packs on top of that instead". The other good news is that a pack is also a natural boundary for fsck (as in "ok, I found that object in a pack, so I won't bother going deeper in the reachability chain"), so if you start packing your repository, fsck will only have to worry about the objects that are unpacked. That makes them work really naturally for archiving, ie this all means that you can avoid a lot of overhead by packing your history every once in a while, with it all being entirely transparent. In other words, if you just pack every month, you can basically guarantee that fsck costs etc never really go up, and your diskspace also goes up only very slowly. The packed format is quite efficient in many ways, but it is totally immutable (ie you can't add anything to an archive - a pack stays the way it always was, and if you want to pack more you have to either re-do the pack or just create a new one) ] Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 21:40 ` Linus Torvalds @ 2005-06-26 22:34 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2005-06-26 22:34 UTC (permalink / raw) To: Chris Mason; +Cc: Jeff Garzik, David S. Miller, Git Mailing List, Nicolas Pitre On Sun, 26 Jun 2005, Linus Torvalds wrote: > > (You can also sort by filename, if you want to try. I don't track > filenames at all there and it's actually non-trivial to do, so that would > require some new and pretty nasty code, but it's possible in _theory_ at > least.) Heh. It's actually very easy if you don't take the "name" too seriously, and you just pick some random one, namely the first one that was used to reach the entry. Then, you might sort the objects on a hash based on the name, and get tons of cheap deltas close-by. It is _uglee_, but hey, it's a heuristic, and it happens to work pretty well. It brought the kernel pack down to 59M even with just a small window of 10. Thanks to Davide for making me think about this hack, although he talked about something much more proper (and harder) than this quick and ugly heuristic ;) The main change is that "git-rev-list --objects" has been changed to show the name the object was reached through. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-26 16:41 ` Linus Torvalds 2005-06-26 18:39 ` Junio C Hamano 2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason @ 2005-06-28 18:06 ` Nicolas Pitre 2005-06-28 19:28 ` Linus Torvalds 2 siblings, 1 reply; 39+ messages in thread From: Nicolas Pitre @ 2005-06-28 18:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason On Sun, 26 Jun 2005, Linus Torvalds wrote: > On Fri, 24 Jun 2005, Linus Torvalds wrote: > > > > yeah, it clearly needs some refining to be useful, but I think you can > > kind of see how it would work. > > Ok, here's how it works. > [...] > > Nico? Chris? Interested in giving it a look? It's kind of a combination of > your things, generalized and then made to have fast lookup with the index. Wow ! Look away for a few days and when you look back your own stuff has been yanked out! But actually I like this packing thing. Certainly much nicer to work with and also more useful. And, since the pack file has its own checksum it is possible to use that pack format (without the index (which can be recreated from the pack file alone anyway) for network transfer. Here's one improvement to the pack format, breaking it early so it won't affect anyone at this point: compressed object header. Instead of the fixed 5 byte header, this patch convert it to a variable size granted most object are small enough to save on the storage of the significant size bytes which will be zero and packing the non-zero byte position with the object type. This can save up to 2% or 21KB on the test I performed with the git repository. diff --git a/pack-objects.c b/pack-objects.c --- a/pack-objects.c +++ b/pack-objects.c @@ -7,7 +7,7 @@ static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] base-name < object-list"; enum object_type { - OBJ_NONE, + OBJ_NONE = 0, OBJ_COMMIT, OBJ_TREE, OBJ_BLOB, @@ -55,7 +55,7 @@ static unsigned long write_object(struct char type[10]; void *buf = read_sha1_file(entry->sha1, type, &size); char header[25]; - unsigned hdrlen, datalen; + unsigned hdrlen, datalen, bit; if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -63,21 +63,33 @@ static unsigned long write_object(struct die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); /* - * The object header is a byte of 'type' followed by four bytes of - * length, except for deltas that has the 20 bytes of delta sha - * instead. + * The object header is one byte which low 4 bits represent the + * object type, the 4 upper bits indicates which of the following + * bytes are used to build the object size. For delta objects the + * sha1 of the reference object is also appended. */ - header[0] = ".CTB"[entry->type]; - hdrlen = 5; + header[0] = entry->type; if (entry->delta) { - header[0] = 'D'; - memcpy(header+5, entry->delta, 20); + header[0] = OBJ_DELTA; buf = delta_against(buf, size, entry); size = entry->delta_size; - hdrlen = 25; } - datalen = htonl(size); - memcpy(header+1, &datalen, 4); + hdrlen = 1; + bit = (1 << 4); + datalen = size; + do { + if (datalen & 0xff) { + header[0] |= bit; + header[hdrlen++] = datalen; + } + bit <<= 1; + datalen >>= 8; + } while (datalen); + if (entry->delta) { + memcpy(header+hdrlen, entry->delta, 20); + hdrlen += 20; + } + sha1write(f, header, hdrlen); datalen = sha1write_compressed(f, buf, size); free(buf); diff --git a/unpack-objects.c b/unpack-objects.c --- a/unpack-objects.c +++ b/unpack-objects.c @@ -12,6 +12,16 @@ struct pack_entry { unsigned char sha1[20]; }; +enum object_type { + OBJ_NONE = 0, + OBJ_COMMIT, + OBJ_TREE, + OBJ_BLOB, + OBJ_DELTA +}; + +static char *type_s[] = { NULL, "commit", "tree", "blob", "delta" }; + static void *pack_base; static unsigned long pack_size; static void *index_base; @@ -92,7 +102,7 @@ static int check_index(void) } static int unpack_non_delta_entry(struct pack_entry *entry, - int kind, + char *type, unsigned char *data, unsigned long size, unsigned long left) @@ -101,9 +111,8 @@ static int unpack_non_delta_entry(struct z_stream stream; char *buffer; unsigned char sha1[20]; - char *type_s; - printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size); + printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size); if (dry_run) return 0; @@ -120,18 +129,12 @@ static int unpack_non_delta_entry(struct inflateEnd(&stream); if ((st != Z_STREAM_END) || stream.total_out != size) goto err_finish; - switch (kind) { - case 'C': type_s = "commit"; break; - case 'T': type_s = "tree"; break; - case 'B': type_s = "blob"; break; - default: goto err_finish; - } - if (write_sha1_file(buffer, size, type_s, sha1) < 0) + if (write_sha1_file(buffer, size, type, sha1) < 0) die("failed to write %s (%s)", - sha1_to_hex(entry->sha1), type_s); - printf("%s %s\n", sha1_to_hex(sha1), type_s); + sha1_to_hex(entry->sha1), type); + printf("%s %s\n", sha1_to_hex(sha1), type); if (memcmp(sha1, entry->sha1, 20)) - die("resulting %s have wrong SHA1", type_s); + die("resulting %s have wrong SHA1", type); finish: st = 0; @@ -183,15 +186,13 @@ static int unpack_delta_entry(struct pac die("truncated pack file"); data = base_sha1 + 20; data_size = left - 20; - printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size); + printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size); printf(" %s\n", sha1_to_hex(base_sha1)); if (dry_run) return 0; - /* pack+5 is the base sha1, unless we have it, we need to - * unpack it first. - */ + /* unless we have the base sha1, we need to unpack it first. */ if (!has_sha1_file(base_sha1)) { struct pack_entry *base; if (!find_pack_entry(base_sha1, &base)) @@ -236,7 +237,9 @@ static int unpack_delta_entry(struct pac static void unpack_entry(struct pack_entry *entry) { unsigned long offset, size, left; - unsigned char *pack; + unsigned char *pack, sizebits; + enum object_type type; + int i; /* Have we done this one already due to deltas based on it? */ if (lookup_object(entry->sha1)) @@ -246,14 +249,25 @@ static void unpack_entry(struct pack_ent if (offset > pack_size - 5) die("object offset outside of pack file"); pack = pack_base + offset; - size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4]; - left = pack_size - offset - 5; - switch (*pack) { - case 'C': case 'T': case 'B': - unpack_non_delta_entry(entry, *pack, pack+5, size, left); + sizebits = *pack++; + type = sizebits & 0x0f; + sizebits >>= 4; + i = size = 0; + while (sizebits) { + if (sizebits & 1) + size |= *pack++ << i; + i += 8; + sizebits >>= 1; + } + left = pack_size - ((void *)pack - pack_base); + switch (type) { + case OBJ_COMMIT: + case OBJ_TREE: + case OBJ_BLOB: + unpack_non_delta_entry(entry, type_s[type], pack, size, left); break; - case 'D': - unpack_delta_entry(entry, pack+5, size, left); + case OBJ_DELTA: + unpack_delta_entry(entry, pack, size, left); break; default: die("corrupted pack file"); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-28 18:06 ` Nicolas Pitre @ 2005-06-28 19:28 ` Linus Torvalds 2005-06-28 21:08 ` Nicolas Pitre 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-28 19:28 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason On Tue, 28 Jun 2005, Nicolas Pitre wrote: > > Here's one improvement to the pack format, breaking it early so it won't > affect anyone at this point: compressed object header. Instead of the > fixed 5 byte header, this patch convert it to a variable size granted > most object are small enough to save on the storage of the significant > size bytes which will be zero and packing the non-zero byte position > with the object type. Ok, this is against an older version and doesn't have that "read_sha1" thing, but yes, something like this would work. I'd prefer the encoding to be a bit different, though: make the size be encoded in seven bits per byte, with the high bit meaning "more to come". We can use four bits from the "type" byte for the initial value, making lengths 0-15 be free. unsigned long size; unsigned char c; c = *pack++; type = size = c & 15; type = (c >> 4) & 7; while (c & 0x80) { c = *pack++; size = (size << 7) + (pack & 0x7f); } or something. That's even denser. However, I also end up wanting to add a "global header" to the pack-file, that contains at least the number of objects packed. We may not know how big the pack-file will be, but we'll at least know how many objects it has before we start writing it. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-28 19:28 ` Linus Torvalds @ 2005-06-28 21:08 ` Nicolas Pitre 2005-06-28 21:27 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Nicolas Pitre @ 2005-06-28 21:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Chris Mason On Tue, 28 Jun 2005, Linus Torvalds wrote: > > > On Tue, 28 Jun 2005, Nicolas Pitre wrote: > > > > Here's one improvement to the pack format, breaking it early so it won't > > affect anyone at this point: compressed object header. Instead of the > > fixed 5 byte header, this patch convert it to a variable size granted > > most object are small enough to save on the storage of the significant > > size bytes which will be zero and packing the non-zero byte position > > with the object type. > > Ok, this is against an older version and doesn't have that "read_sha1" > thing, but yes, something like this would work. > > I'd prefer the encoding to be a bit different, though: make the size be > encoded in seven bits per byte, with the high bit meaning "more to come". > We can use four bits from the "type" byte for the initial value, making > lengths 0-15 be free. > > unsigned long size; > unsigned char c; > > c = *pack++; > type = > size = c & 15; > type = (c >> 4) & 7; > while (c & 0x80) { > c = *pack++; > size = (size << 7) + (pack & 0x7f); > } > > or something. That's even denser. OK. New patch below. > However, I also end up wanting to add a "global header" to the pack-file, > that contains at least the number of objects packed. We may not know how > big the pack-file will be, but we'll at least know how many objects it > has before we start writing it. Probably a version signature would be a good thing too. ===== This patch adds compressed object header. Instead of the fixed 5 byte header, this patch convert it to a variable size granted most object are small enough to save on the storage of the size's most significant bytes which are zero, even ffolding bits with the object type. Objects up to 2047 bytes in size will have a header of only 2 bytes. Signed-off-by: Nicolas Pitre <nico@cam.org> diff --git a/pack-objects.c b/pack-objects.c --- a/pack-objects.c +++ b/pack-objects.c @@ -3,24 +3,17 @@ #include "object.h" #include "delta.h" #include "csum-file.h" +#include "pack.h" static const char pack_usage[] = "git-pack-objects [--window=N] [--depth=N] {--stdout | base-name} < object-list"; -/* - * The object type is a single-character shorthand: - * - 'C' for "Commit" - * - 'T' for "Tree" - * - 'B' for "Blob" - * - 'G' for "taG" - * - 'D' for "Delta" - */ struct object_entry { unsigned char sha1[20]; unsigned long size; unsigned long offset; unsigned int depth; unsigned int hash; - unsigned char type; + packed_obj_type type; unsigned long delta_size; struct object_entry *delta; }; @@ -63,21 +56,30 @@ static unsigned long write_object(struct die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); /* - * The object header is a byte of 'type' followed by four bytes of - * length, except for deltas that has the 20 bytes of delta sha - * instead. + * The object header first byte has its low 4 bits representing the + * object type, the 4 upper bits indicating which of the following + * bytes are used to build the object size. For delta objects the + * sha1 of the reference object is also appended. */ header[0] = entry->type; - hdrlen = 5; if (entry->delta) { - header[0] = 'D'; - memcpy(header+5, entry->delta, 20); + header[0] = PACKED_DELTA; buf = delta_against(buf, size, entry); size = entry->delta_size; - hdrlen = 25; } - datalen = htonl(size); - memcpy(header+1, &datalen, 4); + header[0] |= size << 3; + hdrlen = 1; + datalen = size >> 4; + while (datalen) { + header[hdrlen - 1] |= 0x80; + header[hdrlen++] = datalen; + datalen >>= 7; + } + if (entry->delta) { + memcpy(header+hdrlen, entry->delta, 20); + hdrlen += 20; + } + sha1write(f, header, hdrlen); datalen = sha1write_compressed(f, buf, size); free(buf); @@ -168,13 +170,13 @@ static void check_object(struct object_e if (!sha1_object_info(entry->sha1, type, &entry->size)) { if (!strcmp(type, "commit")) { - entry->type = 'C'; + entry->type = PACKED_COMMIT; } else if (!strcmp(type, "tree")) { - entry->type = 'T'; + entry->type = PACKED_TREE; } else if (!strcmp(type, "blob")) { - entry->type = 'B'; + entry->type = PACKED_BLOB; } else if (!strcmp(type, "tag")) { - entry->type = 'G'; + entry->type = PACKED_TAG; } else die("unable to pack object %s of type %s", sha1_to_hex(entry->sha1), type); diff --git a/pack.h b/pack.h new file mode 100644 --- /dev/null +++ b/pack.h @@ -0,0 +1,19 @@ +#ifndef PACK_H +#define PACK_H + +/* + * The packed object type is stored in the low 3 bits of a byte. + * The type value 0 is a reserved prefix if ever there is more than 7 + * object types, or any future format extensions. + */ + +typedef enum { + PACKED_RESERVED = 0, + PACKED_COMMIT = 1, + PACKED_TREE = 2, + PACKED_BLOB = 3, + PACKED_TAG = 4, + PACKED_DELTA = 7 +} packed_obj_type; + +#endif diff --git a/unpack-objects.c b/unpack-objects.c --- a/unpack-objects.c +++ b/unpack-objects.c @@ -1,6 +1,7 @@ #include "cache.h" #include "object.h" #include "delta.h" +#include "pack.h" static int dry_run; static int nr_entries; @@ -12,6 +13,14 @@ struct pack_entry { unsigned char sha1[20]; }; +static char *type_s[] = { + [PACKED_COMMIT] = "commit", + [PACKED_TREE] = "tree", + [PACKED_BLOB] = "blob", + [PACKED_TAG] = "tag", + [PACKED_DELTA] "delta" +}; + static void *pack_base; static unsigned long pack_size; static void *index_base; @@ -92,7 +101,7 @@ static int check_index(void) } static int unpack_non_delta_entry(struct pack_entry *entry, - int kind, + char *type, unsigned char *data, unsigned long size, unsigned long left) @@ -101,9 +110,8 @@ static int unpack_non_delta_entry(struct z_stream stream; char *buffer; unsigned char sha1[20]; - char *type_s; - printf("%s %c %lu\n", sha1_to_hex(entry->sha1), kind, size); + printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size); if (dry_run) return 0; @@ -120,22 +128,15 @@ static int unpack_non_delta_entry(struct inflateEnd(&stream); if ((st != Z_STREAM_END) || stream.total_out != size) goto err_finish; - switch (kind) { - case 'C': type_s = "commit"; break; - case 'T': type_s = "tree"; break; - case 'B': type_s = "blob"; break; - case 'G': type_s = "tag"; break; - default: goto err_finish; - } - if (write_sha1_file(buffer, size, type_s, sha1) < 0) + if (write_sha1_file(buffer, size, type, sha1) < 0) die("failed to write %s (%s)", - sha1_to_hex(entry->sha1), type_s); - printf("%s %s\n", sha1_to_hex(sha1), type_s); + sha1_to_hex(entry->sha1), type); + printf("%s %s\n", sha1_to_hex(sha1), type); if (memcmp(sha1, entry->sha1, 20)) - die("resulting %s have wrong SHA1", type_s); + die("resulting %s have wrong SHA1", type); - finish: st = 0; + finish: free(buffer); return st; err_finish: @@ -184,15 +185,13 @@ static int unpack_delta_entry(struct pac die("truncated pack file"); data = base_sha1 + 20; data_size = left - 20; - printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size); + printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size); printf(" %s\n", sha1_to_hex(base_sha1)); if (dry_run) return 0; - /* pack+5 is the base sha1, unless we have it, we need to - * unpack it first. - */ + /* unless we have the base sha1, we need to unpack it first. */ if (!has_sha1_file(base_sha1)) { struct pack_entry *base; if (!find_pack_entry(base_sha1, &base)) @@ -237,7 +236,9 @@ static int unpack_delta_entry(struct pac static void unpack_entry(struct pack_entry *entry) { unsigned long offset, size, left; - unsigned char *pack; + unsigned char *pack, sizebits; + packed_obj_type type; + int i; /* Have we done this one already due to deltas based on it? */ if (lookup_object(entry->sha1)) @@ -247,17 +248,28 @@ static void unpack_entry(struct pack_ent if (offset > pack_size - 5) die("object offset outside of pack file"); pack = pack_base + offset; - size = (pack[1] << 24) + (pack[2] << 16) + (pack[3] << 8) + pack[4]; - left = pack_size - offset - 5; - switch (*pack) { - case 'C': case 'T': case 'B': case 'G': - unpack_non_delta_entry(entry, *pack, pack+5, size, left); + sizebits = *pack++; + type = sizebits & 0x07; + size = (sizebits & ~0x80) >> 3; + i = 4; + while (sizebits & 0x80) { + sizebits = *pack++; + size |= (sizebits & ~0x80) << i; + i += 7; + } + left = pack_size - ((void *)pack - pack_base); + switch (type) { + case PACKED_COMMIT: + case PACKED_TREE: + case PACKED_BLOB: + case PACKED_TAG: + unpack_non_delta_entry(entry, type_s[type], pack, size, left); break; - case 'D': - unpack_delta_entry(entry, pack+5, size, left); + case PACKED_DELTA: + unpack_delta_entry(entry, pack, size, left); break; default: - die("corrupted pack file"); + die("corrupted pack file(unknown object type %d)", type); } } ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-28 21:08 ` Nicolas Pitre @ 2005-06-28 21:27 ` Linus Torvalds 2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano 2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2005-06-28 21:27 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Chris Mason On Tue, 28 Jun 2005, Nicolas Pitre wrote: > > OK. New patch below. Dammit, I wasted all that time doing it myself. I just committed and pushed out my version. But mine also does sha1_file.c right, so that you can use a packed archive in .git/objects/pack. Yours has some other cleanups, so.. Can you double-check my version (it hasn't mirrored out yet, it seems, but it should be there soon). > Probably a version signature would be a good thing too. I did that too, same format as for the index file. Nothing checks it though. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH] Bugfix: initialize pack_base to NULL. 2005-06-28 21:27 ` Linus Torvalds @ 2005-06-28 21:55 ` Junio C Hamano 2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre 1 sibling, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-28 21:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This was causing random segfaults, because use_packed_git() got confused by random garbage there. Signed-off-by: Junio C Hamano <junkio@cox.net> --- sha1_file.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) facb119577a28bbb3f2ac1e5f8db37fd2f6d31d8 diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -395,6 +395,7 @@ static struct packed_git *add_packed_git p->pack_size = st.st_size; p->index_base = idx_map; p->next = NULL; + p->pack_base = NULL; p->pack_last_used = 0; return p; } ------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-28 21:27 ` Linus Torvalds 2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano @ 2005-06-29 3:55 ` Nicolas Pitre 2005-06-29 5:16 ` Nicolas Pitre 1 sibling, 1 reply; 39+ messages in thread From: Nicolas Pitre @ 2005-06-29 3:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On Tue, 28 Jun 2005, Linus Torvalds wrote: > > > On Tue, 28 Jun 2005, Nicolas Pitre wrote: > > > > OK. New patch below. > > Dammit, I wasted all that time doing it myself. > > I just committed and pushed out my version. But mine also does sha1_file.c > right, so that you can use a packed archive in .git/objects/pack. Yours > has some other cleanups, so.. > > Can you double-check my version (it hasn't mirrored out yet, it seems, but > it should be there soon). OK... See below the cleanups I merged from my version on top of yours: pack-objects.c | 70 ++++++++++++++----------------------------------------- pack.h | 17 ++++++++----- unpack-objects.c | 66 +++++++++++++++++++++++++-------------------------- 3 files changed, 63 insertions(+), 90 deletions(-) I also restored my original object header size ordering (little endian) for two reasons: - it is much simpler to generate and therefore allows for removing quite some code - it allows for stable bit position which makes it much easier to look at an hex dump of the binary data for manual debugging Also a few code optimizations and one error return fix. Signed-off-by: Nicolas Pitre <nico@cam.org> diff --git a/pack-objects.c b/pack-objects.c --- a/pack-objects.c +++ b/pack-objects.c @@ -34,7 +34,7 @@ static void *delta_against(void *buf, un if (!otherbuf) die("unable to read %s", sha1_to_hex(entry->delta->sha1)); delta_buf = diff_delta(otherbuf, othersize, - buf, size, &delta_size, ~0UL); + buf, size, &delta_size, 0UL); if (!delta_buf || delta_size != entry->delta_size) die("delta size changed"); free(buf); @@ -42,54 +42,13 @@ static void *delta_against(void *buf, un return delta_buf; } -/* - * The per-object header is a pretty dense thing, which is - * - first byte: low four bits are "size", then three bits of "type", - * and the high bit is "size continues". - * - each byte afterwards: low seven bits are size continuation, - * with the high bit being "size continues" - */ -static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr) -{ - int n = 1, i; - unsigned char c; - - if (type < OBJ_COMMIT || type > OBJ_DELTA) - die("bad type %d", type); - - /* - * Shift the size up by 7 bits at a time, - * until you get bits in the "high four". - * That will be our beginning. We'll have - * four size bits in 28..31, then groups - * of seven in 21..27, 14..20, 7..13 and - * finally 0..6. - */ - if (size) { - n = 5; - while (!(size & 0xfe000000)) { - size <<= 7; - n--; - } - } - c = (type << 4) | (size >> 28); - for (i = 1; i < n; i++) { - *hdr++ = c | 0x80; - c = (size >> 21) & 0x7f; - size <<= 7; - } - *hdr = c; - return n; -} - static unsigned long write_object(struct sha1file *f, struct object_entry *entry) { unsigned long size; char type[10]; void *buf = read_sha1_file(entry->sha1, type, &size); - unsigned char header[10]; + char header[25]; unsigned hdrlen, datalen; - enum object_type obj_type; if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -97,22 +56,31 @@ static unsigned long write_object(struct die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); /* - * The object header is a byte of 'type' followed by zero or - * more bytes of length. For deltas, the 20 bytes of delta sha1 - * follows that. + * The object header first byte has its low 3 bits representing the + * object type, the 4 upper bits indicating which of the following + * bytes are used to build the object size. For delta objects the + * sha1 of the reference object is also appended. */ - obj_type = entry->type; if (entry->delta) { + header[0] = OBJ_DELTA; buf = delta_against(buf, size, entry); size = entry->delta_size; - obj_type = OBJ_DELTA; + } else + header[0] = entry->type; + header[0] |= size << 3; + hdrlen = 1; + datalen = size >> 4; + while (datalen) { + header[hdrlen - 1] |= 0x80; + header[hdrlen++] = datalen; + datalen >>= 7; } - hdrlen = encode_header(obj_type, size, header); - sha1write(f, header, hdrlen); if (entry->delta) { - sha1write(f, entry->delta, 20); + memcpy(header+hdrlen, entry->delta, 20); hdrlen += 20; } + + sha1write(f, header, hdrlen); datalen = sha1write_compressed(f, buf, size); free(buf); return hdrlen + datalen; diff --git a/pack.h b/pack.h --- a/pack.h +++ b/pack.h @@ -1,13 +1,18 @@ #ifndef PACK_H #define PACK_H +/* + * The packed object type is stored in the low 3 bits of a byte. + * The type value 0 is a reserved prefix if ever there is more than 7 + * object types, or any future format extensions. + */ enum object_type { - OBJ_NONE, - OBJ_COMMIT, - OBJ_TREE, - OBJ_BLOB, - OBJ_TAG, - OBJ_DELTA, + OBJ_EXT = 0, + OBJ_COMMIT = 1, + OBJ_TREE = 2, + OBJ_BLOB = 3, + OBJ_TAG = 4, + OBJ_DELTA = 7 }; /* diff --git a/unpack-objects.c b/unpack-objects.c --- a/unpack-objects.c +++ b/unpack-objects.c @@ -13,6 +13,14 @@ struct pack_entry { unsigned char sha1[20]; }; +static char *type_string[] = { + [OBJ_COMMIT] = "commit", + [OBJ_TREE] = "tree", + [OBJ_BLOB] = "blob", + [OBJ_TAG] = "tag", + [OBJ_DELTA] = "delta" +}; + static void *pack_base; static unsigned long pack_size; static void *index_base; @@ -93,7 +101,7 @@ static int check_index(void) } static int unpack_non_delta_entry(struct pack_entry *entry, - enum object_type kind, + char *type, unsigned char *data, unsigned long size, unsigned long left) @@ -102,9 +110,8 @@ static int unpack_non_delta_entry(struct z_stream stream; char *buffer; unsigned char sha1[20]; - char *type; - printf("%s %c %lu\n", sha1_to_hex(entry->sha1), ".CTBGD"[kind], size); + printf("%s %s %lu\n", sha1_to_hex(entry->sha1), type, size); if (dry_run) return 0; @@ -121,13 +128,6 @@ static int unpack_non_delta_entry(struct inflateEnd(&stream); if ((st != Z_STREAM_END) || stream.total_out != size) goto err_finish; - switch (kind) { - case OBJ_COMMIT: type = "commit"; break; - case OBJ_TREE: type = "tree"; break; - case OBJ_BLOB: type = "blob"; break; - case OBJ_TAG: type = "tag"; break; - default: goto err_finish; - } if (write_sha1_file(buffer, size, type, sha1) < 0) die("failed to write %s (%s)", sha1_to_hex(entry->sha1), type); @@ -135,8 +135,8 @@ static int unpack_non_delta_entry(struct if (memcmp(sha1, entry->sha1, 20)) die("resulting %s have wrong SHA1", type); - finish: st = 0; + finish: free(buffer); return st; err_finish: @@ -185,15 +185,13 @@ static int unpack_delta_entry(struct pac die("truncated pack file"); data = base_sha1 + 20; data_size = left - 20; - printf("%s D %lu", sha1_to_hex(entry->sha1), delta_size); + printf("%s delta %lu", sha1_to_hex(entry->sha1), delta_size); printf(" %s\n", sha1_to_hex(base_sha1)); if (dry_run) return 0; - /* pack+5 is the base sha1, unless we have it, we need to - * unpack it first. - */ + /* unless we have the base sha1, we need to unpack it first. */ if (!has_sha1_file(base_sha1)) { struct pack_entry *base; if (!find_pack_entry(base_sha1, &base)) @@ -238,8 +236,9 @@ static int unpack_delta_entry(struct pac static void unpack_entry(struct pack_entry *entry) { unsigned long offset, size, left; - unsigned char *pack, c; - int type; + unsigned char c, *pack = pack_base; + int i; + enum object_type type; /* Have we done this one already due to deltas based on it? */ if (lookup_object(entry->sha1)) @@ -247,20 +246,17 @@ static void unpack_entry(struct pack_ent offset = ntohl(entry->offset); if (offset >= pack_size) - goto bad; - - pack = pack_base + offset; - c = *pack++; - offset++; - type = (c >> 4) & 7; - size = (c & 15); + goto out_of_bound; + c = pack[offset++]; + type = c & 0x07; + size = (c & ~0x80) >> 3; + i = 4; while (c & 0x80) { if (offset >= pack_size) - goto bad; - offset++; - c = *pack++; - size = (size << 7) + (c & 0x7f); - + goto out_of_bound; + c = pack[offset++]; + size |= (c & ~0x80) << i; + i += 7; } left = pack_size - offset; switch (type) { @@ -268,14 +264,18 @@ static void unpack_entry(struct pack_ent case OBJ_TREE: case OBJ_BLOB: case OBJ_TAG: - unpack_non_delta_entry(entry, type, pack, size, left); + unpack_non_delta_entry(entry, type_string[type], + pack+offset, size, left); return; case OBJ_DELTA: - unpack_delta_entry(entry, pack, size, left); + unpack_delta_entry(entry, pack+offset, size, left); return; + default: + die("corrupted pack file(unknown object type %d)", type); } -bad: - die("corrupted pack file"); + + out_of_bound: + die("corrupted pack file (object offset out of bound)"); } /* ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre @ 2005-06-29 5:16 ` Nicolas Pitre 2005-06-29 5:43 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Nicolas Pitre @ 2005-06-29 5:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On Tue, 28 Jun 2005, Nicolas Pitre wrote: > OK... See below the cleanups I merged from my version on top of yours: Of course by the time I sent the above you already rewrote the ting to be streamable. So again :-) see below the cleanups I merged from my version on top of yours: pack-objects.c | 70 ++++++++++++++----------------------------------------- pack.h | 17 ++++++++----- unpack-objects.c | 29 ++++++++++++---------- 3 files changed, 46 insertions(+), 70 deletions(-) I also restored my original object header size ordering (little endian) for two reasons: - it is much simpler to generate and therefore allows for removing quite some code - it allows for stable bit position which makes it much easier to look at an hex dump of the binary data for manual debugging Signed-off-by: Nicolas Pitre <nico@cam.org> diff --git a/pack-objects.c b/pack-objects.c --- a/pack-objects.c +++ b/pack-objects.c @@ -34,7 +34,7 @@ static void *delta_against(void *buf, un if (!otherbuf) die("unable to read %s", sha1_to_hex(entry->delta->sha1)); delta_buf = diff_delta(otherbuf, othersize, - buf, size, &delta_size, ~0UL); + buf, size, &delta_size, 0UL); if (!delta_buf || delta_size != entry->delta_size) die("delta size changed"); free(buf); @@ -42,54 +42,13 @@ static void *delta_against(void *buf, un return delta_buf; } -/* - * The per-object header is a pretty dense thing, which is - * - first byte: low four bits are "size", then three bits of "type", - * and the high bit is "size continues". - * - each byte afterwards: low seven bits are size continuation, - * with the high bit being "size continues" - */ -static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr) -{ - int n = 1, i; - unsigned char c; - - if (type < OBJ_COMMIT || type > OBJ_DELTA) - die("bad type %d", type); - - /* - * Shift the size up by 7 bits at a time, - * until you get bits in the "high four". - * That will be our beginning. We'll have - * four size bits in 28..31, then groups - * of seven in 21..27, 14..20, 7..13 and - * finally 0..6. - */ - if (size) { - n = 5; - while (!(size & 0xfe000000)) { - size <<= 7; - n--; - } - } - c = (type << 4) | (size >> 28); - for (i = 1; i < n; i++) { - *hdr++ = c | 0x80; - c = (size >> 21) & 0x7f; - size <<= 7; - } - *hdr = c; - return n; -} - static unsigned long write_object(struct sha1file *f, struct object_entry *entry) { unsigned long size; char type[10]; void *buf = read_sha1_file(entry->sha1, type, &size); - unsigned char header[10]; + char header[25]; unsigned hdrlen, datalen; - enum object_type obj_type; if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -97,22 +56,31 @@ static unsigned long write_object(struct die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); /* - * The object header is a byte of 'type' followed by zero or - * more bytes of length. For deltas, the 20 bytes of delta sha1 - * follows that. + * The object header first byte has its low 3 bits representing the + * object type, the 4 upper bits indicating which of the following + * bytes are used to build the object size. For delta objects the + * sha1 of the reference object is also appended. */ - obj_type = entry->type; if (entry->delta) { + header[0] = OBJ_DELTA; buf = delta_against(buf, size, entry); size = entry->delta_size; - obj_type = OBJ_DELTA; + } else + header[0] = entry->type; + header[0] |= size << 3; + hdrlen = 1; + datalen = size >> 4; + while (datalen) { + header[hdrlen - 1] |= 0x80; + header[hdrlen++] = datalen; + datalen >>= 7; } - hdrlen = encode_header(obj_type, size, header); - sha1write(f, header, hdrlen); if (entry->delta) { - sha1write(f, entry->delta, 20); + memcpy(header+hdrlen, entry->delta, 20); hdrlen += 20; } + + sha1write(f, header, hdrlen); datalen = sha1write_compressed(f, buf, size); free(buf); return hdrlen + datalen; diff --git a/pack.h b/pack.h --- a/pack.h +++ b/pack.h @@ -1,13 +1,18 @@ #ifndef PACK_H #define PACK_H +/* + * The packed object type is stored in the low 3 bits of a byte. + * The type value 0 is a reserved prefix if ever there is more than 7 + * object types, or any future format extensions. + */ enum object_type { - OBJ_NONE, - OBJ_COMMIT, - OBJ_TREE, - OBJ_BLOB, - OBJ_TAG, - OBJ_DELTA, + OBJ_EXT = 0, + OBJ_COMMIT = 1, + OBJ_TREE = 2, + OBJ_BLOB = 3, + OBJ_TAG = 4, + OBJ_DELTA = 7 }; /* diff --git a/unpack-objects.c b/unpack-objects.c --- a/unpack-objects.c +++ b/unpack-objects.c @@ -6,6 +6,14 @@ static int dry_run; static const char unpack_usage[] = "git-unpack-objects < pack-file"; +static char *type_string[] = { + [OBJ_COMMIT] = "commit", + [OBJ_TREE] = "tree", + [OBJ_BLOB] = "blob", + [OBJ_TAG] = "tag", + [OBJ_DELTA] = "delta" +}; + /* We always read in 4kB chunks. */ static unsigned char buffer[4096]; static unsigned long offset, len, eof; @@ -139,19 +147,11 @@ static void added_object(unsigned char * } } -static int unpack_non_delta_entry(enum object_type kind, unsigned long size) +static int unpack_non_delta_entry(char *type, unsigned long size) { void *buf = get_data(size); unsigned char sha1[20]; - char *type; - switch (kind) { - case OBJ_COMMIT: type = "commit"; break; - case OBJ_TREE: type = "tree"; break; - case OBJ_BLOB: type = "blob"; break; - case OBJ_TAG: type = "tag"; break; - default: die("bad type %d", kind); - } if (write_sha1_file(buf, size, type, sha1) < 0) die("failed to write object"); added_object(sha1, type, buf, size); @@ -184,26 +184,29 @@ static int unpack_delta_entry(unsigned l static void unpack_one(void) { unsigned char *pack, c; + unsigned int i; unsigned long size; enum object_type type; pack = fill(1); c = *pack; use(1); - type = (c >> 4) & 7; - size = (c & 15); + type = c & 0x07; + size = (c & ~0x80) >> 3; + i = 4; while (c & 0x80) { pack = fill(1); c = *pack++; use(1); - size = (size << 7) + (c & 0x7f); + size |= (c & ~0x80) << i; + i += 7; } switch (type) { case OBJ_COMMIT: case OBJ_TREE: case OBJ_BLOB: case OBJ_TAG: - unpack_non_delta_entry(type, size); + unpack_non_delta_entry(type_string[type], size); return; case OBJ_DELTA: unpack_delta_entry(size); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-29 5:16 ` Nicolas Pitre @ 2005-06-29 5:43 ` Linus Torvalds 2005-06-29 5:54 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-29 5:43 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List On Wed, 29 Jun 2005, Nicolas Pitre wrote: > > Of course by the time I sent the above you already rewrote the ting to > be streamable. And by the time you sent me a new version, I'd already taken part of your old one by hand ;) Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-29 5:43 ` Linus Torvalds @ 2005-06-29 5:54 ` Linus Torvalds 2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-06-29 5:54 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List On Tue, 28 Jun 2005, Linus Torvalds wrote: > > On Wed, 29 Jun 2005, Nicolas Pitre wrote: > > > > Of course by the time I sent the above you already rewrote the ting to > > be streamable. > > And by the time you sent me a new version, I'd already taken part of your > old one by hand ;) Btw, I have the size/type bits reversed from your setup, but please don't change that, since that would be yet another incompatible pack format change, and I'd like to calm things down. Also, I notice that you decode the sizes really strangely: you have a "while() { }" loop and two separate loads. It's much nicer to do it with a "do { } while()" loop and a single load, since not only is it less code, a do-while loop compiles to better code than a while() loop (unless the compiler is crazy, which it sometimes is). Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Last mile for 1.0 again 2005-06-29 5:54 ` Linus Torvalds @ 2005-06-29 7:16 ` Junio C Hamano 2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano 2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow 0 siblings, 2 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-29 7:16 UTC (permalink / raw) To: git; +Cc: Linus Torvalds >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> ..., but please don't LT> change that, since that would be yet another incompatible pack format LT> change, and I'd like to calm things down. Calming things down is good. Some stuff we should be looking at during calming down period are: - Ask expert help to verify our use of zlib is OK; Linus asked for this in a message tonight. I am not qualified to do this myself. - There is currently nothing that verifies the SHA1 sums embedded in idx and pack files. A tool for it is needed [*1*]. I may be doing this myself. - Although there is a hook in sha1_file() that allows mapping a pack file on demand (and keep track of their uses for LRU eviction), there is currently no code to actually evict a mapped pack file. We need to add some sort of use-reference counting to lock down a pack file in use, if somebody is inclined to do this. I will not be doing this myself immediately. - Think about what to do with the "dumb server" pull protocols regarding packed archive files (think about http-pull). I will not be doing this myself. - Design "smart server" pull/push pair. This can independently and actively be done without impacting "calming down" period, and I am assuming Dan is already looking into this. This may result in another "deathmatch" between Dan and Jason Mcmullan, which would be fun to watch ;-). Not limited to "calming down the packed GIT", but bigger picture pre-release preparation items I think we need to look into are: - Update "Last mile for 1.0" list. I think the only thing that is left from the one I posted on Jun 5th is the tutorials. Anybody interested in adding more pages to the tutorials? If there is no taker, I may start continuing what is already there, stealing example project outline from "Arch meets Hello World" [*2*]. - Double check the documentation, usage strings in the code and what the code does still match. Last time I checked, I think some files in Documentation/ directory were not reachable from the main GIT(1) page and were not touched by the build process from Documentation/Makefile. I will not be doing this myself. - Blame/Annotate. Does anybody have a fast and correct one [*3*]? - Publicity. Somebody may want to start preparing materials to have us included in http://better-scm.berlios.de/comparison/ [Footnotes] *1* One possibility is to add that to fsck-cache when --full is used, but I am somewhat reluctant to do it that way. It would require you to place that "suspicious" pack under your repository's .git/objects/pack in order to verify it, which can spread the damage before you know it. In general, idx file is relatively small, so we _could_ check the SHA1 sum for idx files when we map them at runtime in check_packed_git_idx(). However, verifying 1MB (kernel archive case) idx file every time we run a single core GIT command may be a price we would not want to pay, so I am not too enthusiastic about doing it. We certainly would not want to do this for pack files every time a pack is mapped at runtime (60MB kernel archive case). *2* http://regexps.srparish.net/www/tutorial/html/arch.html. I am not talking about stealing the concepts but just talking about stealing the sample project on which example players work in the sample sessions. I think its section on update/replay matches closely to merge/rebase, for example. *3* I have what I wrote in Perl which does rename/copy/rewrite and multiple parents correctly, but it is way too slow for on-demand use. ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH] Add git-verify-pack command. 2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano @ 2005-06-29 9:51 ` Junio C Hamano 2005-06-29 16:15 ` Linus Torvalds 2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow 1 sibling, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2005-06-29 9:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Given a list of <pack>.idx files, this command validates the index file and the corresponding .pack file for consistency. This patch also uses the same validation mechanism in fsck-cache when the --full flag is used. During normal operation, sha1_file.c verifies that a given .idx file matches the .pack file by comparing the SHA1 checksum stored in .idx file and .pack file as a minimum sanity check. We may further want to check the pack signature and version when we map the pack, but that would be a separate patch. Earlier, errors to map a pack file was not flagged fatal but led to a random fatal error later. This version explicitly die()s when such an error is detected. Signed-off-by: Junio C Hamano <junkio@cox.net> --- *** This actually covers two of the items I listed in the "Last *** mile again" message. I ended up doing the LRU myself. *** Linus, what would be the practical/recommended/sane limit *** for mmap regions we would want to use? Currently I start *** throwing away mapped packs when the total size of mapped *** packs exceeds 64MB. *** Currently, pack-objects/unpack-objects pair is not *** documented. Any takers? *** Oh, again, Linus, what is your preferred way to get a *** cover-letter material (like this) for a single patch? I *** always do a separate cover letter for multi-patch series *** with [PATCH 0/N], but do you prefer to get [PATCH 0/1] and *** [PATCH 1/1] for a single patch, or do you prefer to see *** commentaries after the three-dash line like this message *** does? Documentation/git-verify-pack.txt | 38 +++++++++++++++++++++ Documentation/git.txt | 3 ++ Makefile | 5 ++- cache.h | 4 ++ fsck-cache.c | 5 +++ pack.h | 2 + sha1_file.c | 66 ++++++++++++++++++++++++++++--------- t/t5300-pack-object.sh | 38 +++++++++++++++++++++ verify-pack.c | 26 +++++++++++++++ verify_pack.c | 62 +++++++++++++++++++++++++++++++++++ 10 files changed, 231 insertions(+), 18 deletions(-) create mode 100644 Documentation/git-verify-pack.txt create mode 100644 verify-pack.c create mode 100644 verify_pack.c 8ac4f374a587c58b3d2ecc44220b6b866f769350 diff --git a/Documentation/git-verify-pack.txt b/Documentation/git-verify-pack.txt new file mode 100644 --- /dev/null +++ b/Documentation/git-verify-pack.txt @@ -0,0 +1,38 @@ +git-verify-pack(1) +================== +v0.1, June 2005 + +NAME +---- +git-verify-pack - Validate packed GIT archive files. + + +SYNOPSIS +-------- +'git-verify-pack' <pack>.idx ... + + +DESCRIPTION +----------- +Reads given idx file for packed GIT archive created with +git-pack-objects command and verifies idx file and the +corresponding pack file. + +OPTIONS +------- +<pack>.idx ...:: + The idx files to verify. + + +Author +------ +Written by Junio C Hamano <junkio@cox.net> + +Documentation +-------------- +Documentation by Junio C Hamano + +GIT +--- +Part of the link:git.html[git] suite + diff --git a/Documentation/git.txt b/Documentation/git.txt --- a/Documentation/git.txt +++ b/Documentation/git.txt @@ -110,6 +110,9 @@ link:git-tar-tree.html[git-tar-tree]:: link:git-unpack-file.html[git-unpack-file]:: Creates a temporary file with a blob's contents +link:git-verify-pack.html[git-verify-pack]:: + Validates packed GIT archive files + The interrogate commands may create files - and you can force them to touch the working file set - but in general they don't diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -36,7 +36,7 @@ PROG= git-update-cache git-diff-files git-diff-helper git-tar-tree git-local-pull git-write-blob \ git-get-tar-commit-id git-apply git-stripspace \ git-cvs2git git-diff-stages git-rev-parse git-patch-id \ - git-pack-objects git-unpack-objects + git-pack-objects git-unpack-objects git-verify-pack all: $(PROG) @@ -45,7 +45,7 @@ install: $(PROG) $(SCRIPTS) LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \ tag.o date.o index.o diff-delta.o patch-delta.o entry.o \ - epoch.o refs.o csum-file.o + epoch.o refs.o csum-file.o verify_pack.o LIB_FILE=libgit.a LIB_H=cache.h object.h blob.h tree.h commit.h tag.h delta.h epoch.h csum-file.h pack.h @@ -124,6 +124,7 @@ git-rev-parse: rev-parse.c git-patch-id: patch-id.c git-pack-objects: pack-objects.c git-unpack-objects: unpack-objects.c +git-verify-pack: verify-pack.c git-http-pull: LIBS += -lcurl git-rev-list: LIBS += -lssl diff --git a/cache.h b/cache.h --- a/cache.h +++ b/cache.h @@ -246,9 +246,13 @@ extern struct packed_git { unsigned int *index_base; void *pack_base; unsigned int pack_last_used; + unsigned int pack_use_cnt; char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */ } *packed_git; extern void prepare_packed_git(void); +extern int use_packed_git(struct packed_git *); +extern void unuse_packed_git(struct packed_git *); +extern struct packed_git *add_packed_git(char *, int); extern int num_packed_objects(const struct packed_git *p); extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*); diff --git a/fsck-cache.c b/fsck-cache.c --- a/fsck-cache.c +++ b/fsck-cache.c @@ -6,6 +6,7 @@ #include "tree.h" #include "blob.h" #include "tag.h" +#include "pack.h" #define REACHABLE 0x0001 @@ -437,6 +438,10 @@ int main(int argc, char **argv) alt_odb[j].name[-1] = '/'; } prepare_packed_git(); + for (p = packed_git; p; p = p->next) + /* verify gives error messages itself */ + verify_pack(p); + for (p = packed_git; p; p = p->next) { int num = num_packed_objects(p); for (i = 0; i < num; i++) { diff --git a/pack.h b/pack.h --- a/pack.h +++ b/pack.h @@ -27,4 +27,6 @@ struct pack_header { unsigned int hdr_entries; }; +extern int verify_pack(struct packed_git *); + #endif diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -302,7 +302,7 @@ static int check_packed_git_idx(const ch index = idx_map; /* check index map */ - if (idx_size < 4*256 + 20) + if (idx_size < 4*256 + 20 + 20) return error("index file too small"); nr = 0; for (i = 0; i < 256; i++) { @@ -327,12 +327,29 @@ static int check_packed_git_idx(const ch return 0; } -static void unuse_one_packed_git(void) +static int unuse_one_packed_git(void) { - /* NOTYET */ + struct packed_git *p, *lru = NULL; + + for (p = packed_git; p; p = p->next) { + if (p->pack_use_cnt || !p->pack_base) + continue; + if (!lru || p->pack_last_used < lru->pack_last_used) + lru = p; + } + if (!lru) + return 0; + munmap(lru->pack_base, lru->pack_size); + lru->pack_base = NULL; + return 1; +} + +void unuse_packed_git(struct packed_git *p) +{ + p->pack_use_cnt--; } -static int use_packed_git(struct packed_git *p) +int use_packed_git(struct packed_git *p) { if (!p->pack_base) { int fd; @@ -340,28 +357,36 @@ static int use_packed_git(struct packed_ void *map; pack_mapped += p->pack_size; - while (PACK_MAX_SZ < pack_mapped) - unuse_one_packed_git(); + while (PACK_MAX_SZ < pack_mapped && unuse_one_packed_git()) + ; /* nothing */ fd = open(p->pack_name, O_RDONLY); if (fd < 0) - return -1; + die("packfile %s cannot be opened", p->pack_name); if (fstat(fd, &st)) { close(fd); - return -1; + die("packfile %s cannot be opened", p->pack_name); } if (st.st_size != p->pack_size) - return -1; + die("packfile %s size mismatch.", p->pack_name); map = mmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (map == MAP_FAILED) - return -1; + die("packfile %s cannot be mapped.", p->pack_name); p->pack_base = map; + + /* Check if the pack file matches with the index file. + * this is cheap. + */ + if (memcmp((char*)(p->index_base) + p->index_size - 40, + p->pack_base + p->pack_size - 20, 20)) + die("packfile %s does not match index.", p->pack_name); } p->pack_last_used = pack_used_ctr++; + p->pack_use_cnt++; return 0; } -static struct packed_git *add_packed_git(char *path, int path_len) +struct packed_git *add_packed_git(char *path, int path_len) { struct stat st; struct packed_git *p; @@ -388,6 +413,7 @@ static struct packed_git *add_packed_git p->next = NULL; p->pack_base = NULL; p->pack_last_used = 0; + p->pack_use_cnt = 0; return p; } @@ -671,6 +697,7 @@ static int packed_object_info(struct pac unsigned long offset, size, left; unsigned char *pack; enum object_type kind; + int retval; if (use_packed_git(p)) die("cannot map packed file"); @@ -681,8 +708,9 @@ static int packed_object_info(struct pac switch (kind) { case OBJ_DELTA: - return packed_delta_info(pack, size, left, type, sizep); - break; + retval = packed_delta_info(pack, size, left, type, sizep); + unuse_packed_git(p); + return retval; case OBJ_COMMIT: strcpy(type, "commit"); break; @@ -699,6 +727,7 @@ static int packed_object_info(struct pac die("corrupted pack file"); } *sizep = size; + unuse_packed_git(p); return 0; } @@ -785,6 +814,7 @@ static void *unpack_entry(struct pack_en unsigned long offset, size, left; unsigned char *pack; enum object_type kind; + void *retval; if (use_packed_git(p)) die("cannot map packed file"); @@ -794,7 +824,9 @@ static void *unpack_entry(struct pack_en left = p->pack_size - offset; switch (kind) { case OBJ_DELTA: - return unpack_delta_entry(pack, size, left, type, sizep); + retval = unpack_delta_entry(pack, size, left, type, sizep); + unuse_packed_git(p); + return retval; case OBJ_COMMIT: strcpy(type, "commit"); break; @@ -811,12 +843,14 @@ static void *unpack_entry(struct pack_en die("corrupted pack file"); } *sizep = size; - return unpack_non_delta_entry(pack, size, left); + retval = unpack_non_delta_entry(pack, size, left); + unuse_packed_git(p); + return retval; } int num_packed_objects(const struct packed_git *p) { - /* See check_packed_git_idx and pack-objects.c */ + /* See check_packed_git_idx() */ return (p->index_size - 20 - 20 - 4*256) / 24; } diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -127,4 +127,42 @@ test_expect_success \ } >current && diff expect current' +unset GIT_OBJECT_DIRECTORY + +test_expect_success \ + 'verify pack' \ + 'git-verify-pack test-1.idx test-2.idx' + +test_expect_success \ + 'corrupt a pack and see if verify catches' \ + 'cp test-1.idx test-3.idx && + cp test-2.pack test-3.pack && + if git-verify-pack test-3.idx + then false + else :; + fi && + + cp test-1.pack test-3.pack && + dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=2 && + if git-verify-pack test-3.idx + then false + else :; + fi && + + cp test-1.pack test-3.pack && + dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=7 && + if git-verify-pack test-3.idx + then false + else :; + fi && + + cp test-1.pack test-3.pack && + dd if=/dev/zero of=test-3.pack count=1 bs=1 conv=notrunc seek=12 && + if git-verify-pack test-3.idx + then false + else :; + fi && + + :' + test_done diff --git a/verify-pack.c b/verify-pack.c new file mode 100644 --- /dev/null +++ b/verify-pack.c @@ -0,0 +1,26 @@ +#include "cache.h" +#include "pack.h" + +static int verify_one_pack(char *arg) +{ + struct packed_git *g = add_packed_git(arg, strlen(arg)); + if (!g) + return -1; + return verify_pack(g); +} + +int main(int ac, char **av) +{ + int errs = 0; + + while (1 < ac) { + char path[PATH_MAX]; + strcpy(path, av[1]); + if (verify_one_pack(path)) + errs++; + else + printf("%s: OK\n", av[1]); + ac--; av++; + } + return !!errs; +} diff --git a/verify_pack.c b/verify_pack.c new file mode 100644 --- /dev/null +++ b/verify_pack.c @@ -0,0 +1,62 @@ +#include "cache.h" +#include "pack.h" + +static int verify_packfile(struct packed_git *p) +{ + unsigned long index_size = p->index_size; + void *index_base = p->index_base; + SHA_CTX ctx; + unsigned char sha1[20]; + unsigned long pack_size = p->pack_size; + void *pack_base; + struct pack_header *hdr; + int nr_objects; + + hdr = p->pack_base; + if (hdr->hdr_signature != htonl(PACK_SIGNATURE)) + return error("Packfile signature mismatch", p->pack_name); + if (hdr->hdr_version != htonl(PACK_VERSION)) + return error("Packfile version %d different from ours %d", + ntohl(hdr->hdr_version), PACK_VERSION); + nr_objects = ntohl(hdr->hdr_entries); + if (num_packed_objects(p) != nr_objects) + return error("Packfile claims to have %d objects, " + "while idx size expects %d", nr_objects, + num_packed_objects(p)); + + SHA1_Init(&ctx); + pack_base = p->pack_base; + SHA1_Update(&ctx, pack_base, pack_size - 20); + SHA1_Final(sha1, &ctx); + if (memcmp(sha1, index_base + index_size - 40, 20)) + return error("Packfile %s SHA1 mismatch with idx", + p->pack_name); + if (memcmp(sha1, pack_base + pack_size - 20, 20)) + return error("Packfile %s SHA1 mismatch with itself", + p->pack_name); + return 0; +} + + +int verify_pack(struct packed_git *p) +{ + unsigned long index_size = p->index_size; + void *index_base = p->index_base; + SHA_CTX ctx; + unsigned char sha1[20]; + int ret; + + /* Verify SHA1 sum of the index file */ + SHA1_Init(&ctx); + SHA1_Update(&ctx, index_base, index_size - 20); + SHA1_Final(sha1, &ctx); + if (memcmp(sha1, index_base + index_size - 20, 20)) + return error("Packfile index for %s SHA1 mismatch", + p->pack_name); + + /* Verify pack file */ + use_packed_git(p); + ret = verify_packfile(p); + unuse_packed_git(p); + return ret; +} ------------ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] Add git-verify-pack command. 2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano @ 2005-06-29 16:15 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2005-06-29 16:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Wed, 29 Jun 2005, Junio C Hamano wrote: > > *** Linus, what would be the practical/recommended/sane limit > *** for mmap regions we would want to use? Currently I start > *** throwing away mapped packs when the total size of mapped > *** packs exceeds 64MB. Hey, anybody who ever used BK had better have had 1GB of memory for any real development. So 64MB is peanuts, but it sounds like a good guess. > *** Oh, again, Linus, what is your preferred way to get a > *** cover-letter material (like this) for a single patch? I don't want cover-letters for single patches, or necessarily even for short series (1-3 entries). The cover letter is more interesting for large series, or even with small series if the early patches don't make much sense on their own: then the cover-letter ends up being a useful place to explain what the _sequence_ does, and why patch #2 that seems to be totally useless and removes a feature is actually good ("we'll re-implement it better in #5 after we've cleaned the code up"). So generally commentaries after the three dashes is good, if the commentary is "local", ie related not to a series. Only with non-local explanations does a separate [0/N] thing make sense to me. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano 2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano @ 2005-07-04 21:40 ` Daniel Barkalow 2005-07-04 21:45 ` Junio C Hamano 2005-07-04 21:59 ` Linus Torvalds 1 sibling, 2 replies; 39+ messages in thread From: Daniel Barkalow @ 2005-07-04 21:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Linus Torvalds On Wed, 29 Jun 2005, Junio C Hamano wrote: > - Blame/Annotate. Does anybody have a fast and correct one How about an option to git-rev-list to take a path, and (1) exclude any branch where the version at that path ends up ignored in a merge and (2) not list any revision where the version at that path is identical to a parent? This should give you the list of all commits which are directly responsible for the present state of the file, which can then be formatted as desired. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow @ 2005-07-04 21:45 ` Junio C Hamano 2005-07-04 21:59 ` Linus Torvalds 1 sibling, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-07-04 21:45 UTC (permalink / raw) To: Daniel Barkalow; +Cc: git, Linus Torvalds >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes: DB> On Wed, 29 Jun 2005, Junio C Hamano wrote: >> - Blame/Annotate. Does anybody have a fast and correct one DB> How about an option to git-rev-list to take a path, and (1) exclude any DB> branch where the version at that path ends up ignored in a merge and DB> (2) not list any revision where the version at that path is identical to a DB> parent? DB> This should give you the list of all commits which are directly DB> responsible for the present state of the file, which can then be formatted DB> as desired. Sounds close enough if you do not care about copies and complete rewrites. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow 2005-07-04 21:45 ` Junio C Hamano @ 2005-07-04 21:59 ` Linus Torvalds 2005-07-04 22:41 ` Daniel Barkalow 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2005-07-04 21:59 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Junio C Hamano, git On Mon, 4 Jul 2005, Daniel Barkalow wrote: > > How about an option to git-rev-list to take a path, and (1) exclude any > branch where the version at that path ends up ignored in a merge and > (2) not list any revision where the version at that path is identical to a > parent? Hmm. How is that different from "git-whatchanged path", really? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-04 21:59 ` Linus Torvalds @ 2005-07-04 22:41 ` Daniel Barkalow 2005-07-04 23:06 ` Junio C Hamano 0 siblings, 1 reply; 39+ messages in thread From: Daniel Barkalow @ 2005-07-04 22:41 UTC (permalink / raw) To: Linus Torvalds, Junio C Hamano; +Cc: git On Mon, 4 Jul 2005, Linus Torvalds wrote: > On Mon, 4 Jul 2005, Daniel Barkalow wrote: > > > > How about an option to git-rev-list to take a path, and (1) exclude any > > branch where the version at that path ends up ignored in a merge and > > (2) not list any revision where the version at that path is identical to a > > parent? > > Hmm. How is that different from "git-whatchanged path", really? It would short-circuit going up areas of the history which don't contribute (i.e., lead up to a merge which took its version from a different parent). It could also stop when it ran out of branches that have the file at all. Neither of these is all that significant, I guess. Junio: what's missing from annotate/blame? -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-04 22:41 ` Daniel Barkalow @ 2005-07-04 23:06 ` Junio C Hamano 2005-07-05 1:54 ` Daniel Barkalow 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2005-07-04 23:06 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Linus Torvalds, git >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes: DB> Junio: what's missing from annotate/blame? Which one are you talking about? What I use to generate http://members.cox.net/junkio/Summary.txt is an implementation of an algorithm I consider "complete" in that it does rename/copy and complete rewrite correctly. What is missing from the implementation is efficiency. ------------ #!/usr/bin/perl -w use strict; package main; $::debug = 0; sub read_blob { my $sha1 = shift; my $fh = undef; my $result; local ($/) = undef; open $fh, '-|', 'git-cat-file', 'blob', $sha1 or die "cannot read blob $sha1"; $result = join('', <$fh>); close $fh or die "failure while closing pipe to git-cat-file"; return $result; } sub read_diff_raw { my ($parent, $filename) = @_; my $fh = undef; local ($/) = "\0"; my @result = (); my ($meta, $status, $sha1_1, $sha1_2, $file1, $file2); print STDERR "* diff-cache --cached $parent $filename\n" if $::debug; my $has_changes = 0; open $fh, '-|', 'git-diff-cache', '--cached', '-z', $parent, $filename or die "cannot read git-diff-cache $parent $filename"; while (defined ($meta = <$fh>)) { $has_changes = 1; } close $fh or die "failure while closing pipe to git-diff-cache"; if (!$has_changes) { return (); } $fh = undef; print STDERR "* diff-cache -B -C --find-copies-harder --cached $parent\n" if $::debug; open($fh, '-|', 'git-diff-cache', '-B', '-C', '--find-copies-harder', '--cached', '-z', $parent) or die "cannot read git-diff-cache with $parent"; while (defined ($meta = <$fh>)) { chomp($meta); (undef, undef, $sha1_1, $sha1_2, $status) = split(/ /, $meta); $file1 = <$fh>; chomp($file1); if ($status =~ /^[CR]/) { $file2 = <$fh>; chomp($file2); } elsif ($status =~ /^D/) { next; } else { $file2 = $file1; } if ($file2 eq $filename) { push @result, [$status, $sha1_1, $sha1_2, $file1, $file2]; } } close $fh or die "failure while closing pipe to git-diff-cache"; return @result; } sub write_temp_blob { my ($sha1, $temp) = @_; my $fh = undef; my $blob = read_blob($sha1); open $fh, '>', $temp or die "cannot open temporary file $temp"; print $fh $blob; close($fh); } package Git::Patch; sub new { my ($class, $sha1_1, $sha1_2) = @_; my $self = bless [], $class; my $fh = undef; ::write_temp_blob($sha1_1, "/tmp/blame-$$-1"); ::write_temp_blob($sha1_2, "/tmp/blame-$$-2"); open $fh, '-|', 'diff', '-u0', "/tmp/blame-$$-1", "/tmp/blame-$$-2" or die "cannot read diff"; while (<$fh>) { if (/^\@\@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? \@\@/) { push @$self, [$1, (defined $2 ? $2 : 1), $3, (defined $4 ? $4 : 1)]; } } close $fh; unlink "/tmp/blame-$$-1", "/tmp/blame-$$-2"; return $self; } sub find_parent_line { my ($self, $commit_lineno) = @_; my $ofs = 0; for (@$self) { my ($line_1, $len_1, $line_2, $len_2) = @$_; if ($commit_lineno < $line_2) { return $commit_lineno - $ofs; } if ($line_2 <= $commit_lineno && $commit_lineno < $line_2 + $len_2) { return -1; # changed by commit. } $ofs += ($len_1 - $len_2); } return $commit_lineno + $ofs; } package Git::Commit; my %author_name_canon = ('Linus Torvalds <torvalds@ppc970.osdl.org.(none)>' => 'Linus Torvalds <torvalds@ppc970.osdl.org>'); sub canon_author_name { my ($name) = @_; if (exists $author_name_canon{$name}) { return $author_name_canon{$name}; } return $name; } sub new { my $class = shift; my $self = bless { PARENT => [], TREE => undef, AUTHOR => undef, COMMITTER => undef, }, $class; my $commit_sha1 = shift; $self->{SHA1} = $commit_sha1; my $fh = undef; open $fh, '-|', 'git-cat-file', 'commit', $commit_sha1 or die "cannot read commit object $commit_sha1"; while (<$fh>) { chomp; if (/^tree ([0-9a-f]{40})$/) { $self->{TREE} = $1; } elsif (/^parent ([0-9a-f]{40})$/) { push @{$self->{PARENT}}, $1; } elsif (/^author ([^>]+>)/) { $self->{AUTHOR} = canon_author_name($1); } elsif (/^committer ([^>]+>)/) { $self->{COMMITTER} = canon_author_name($1); } } close $fh or die "failure while closing pipe to git-cat-file"; return $self; } sub find_file { my ($commit, $path) = @_; my $result = undef; my $fh = undef; local ($/) = "\0"; open $fh, '-|', 'git-ls-tree', '-z', '-r', '-d', $commit->{TREE}, $path or die "cannot read git-ls-tree $commit->{TREE}"; while (<$fh>) { chomp; if (/^[0-7]{6} blob ([0-9a-f]{40}) (.*)$/) { if ($2 ne $path) { die "$2 ne $path???"; } $result = $1; last; } } close $fh or die "failure while closing pipe to git-ls-tree"; return $result; } package Git::Blame; sub new { my $class = shift; my $self = bless { LINE => [], UNKNOWN => undef, WORK => [], }, $class; my $commit = shift; my $filename = shift; my $sha1 = $commit->find_file($filename); my $blob = ::read_blob($sha1); my @blob = (split(/\n/, $blob)); for (my $i = 0; $i < @blob; $i++) { $self->{LINE}[$i] = +{ COMMIT => $commit, FOUND => undef, FILENAME => $filename, LINENO => ($i + 1), }; } $self->{UNKNOWN} = scalar @blob; push @{$self->{WORK}}, [$commit, $filename]; return $self; } sub read_blame_cache { my $self = shift; my $filename = shift; my $fh = undef; my $pi = $self->{'PATHINFO'} = {}; open $fh, '<', $filename; while (<$fh>) { chomp; my ($commit, $parent, $path) = split(/\t/, $_); $pi->{$path}{$commit}{$parent} = 1; } close $fh; } sub print { my $self = shift; my $line_termination = shift; for (my $i = 0; $i < @{$self->{LINE}}; $i++) { my $l = $self->{LINE}[$i]; print ($l->{FOUND} ? ':' : '?');; print "$l->{COMMIT}->{SHA1} "; print "$l->{COMMIT}->{AUTHOR} "; print "$l->{COMMIT}->{COMMITTER} "; print "$l->{LINENO} $l->{FILENAME}"; print $line_termination; } } sub take_responsibility { my ($self, $commit) = @_; for (my $i = 0; $i < @{$self->{LINE}}; $i++) { my $l = $self->{LINE}[$i]; if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) { $l->{FOUND} = 1; $self->{UNKNOWN}--; } } } sub blame_parent { my ($self, $commit, $parent, $filename) = @_; my @diff = ::read_diff_raw($parent->{SHA1}, $filename); my $filename_in_parent; my $passed_blame_to_parent = undef; if (@diff == 0) { # We have not touched anything. Blame parent for everything # that we are suspected for. for (my $i = 0; $i < @{$self->{LINE}}; $i++) { my $l = $self->{LINE}[$i]; if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) { $l->{COMMIT} = $parent; $passed_blame_to_parent = 1; } } $filename_in_parent = $filename; } elsif (@diff != 1) { # This should not happen. for (@diff) { print "** @$_\n"; } die "Oops"; } else { my ($status, $sha1_1, $sha1_2, $file1, $file2) = @{$diff[0]}; print STDERR "** $status $file1 $file2\n" if $::debug; if ($status =~ /N/ || $status =~ /M[0-9][0-9]/) { # Either some of other parents created it, or we did. # At this point the only thing we know is that this # parent is not responsible for it. ; } else { my $patch = Git::Patch->new($sha1_1, $sha1_2); $filename_in_parent = $file1; for (my $i = 0; $i < @{$self->{LINE}}; $i++) { my $l = $self->{LINE}[$i]; if (! $l->{FOUND} && $l->{COMMIT}->{SHA1} eq $commit->{SHA1}) { # We are suspected to have introduced this line. # Does it exist in the parent? my $lineno = $l->{LINENO}; my $parent_line = $patch->find_parent_line($lineno); if ($parent_line < 0) { # No, we may be the guilty ones, or some other # parent might be. We do not assign blame to # ourselves here yet. ; } else { # This line is coming from the parent, so pass # blame to it. $l->{COMMIT} = $parent; $l->{FILENAME} = $file1; $l->{LINENO} = $parent_line; $passed_blame_to_parent = 1; } } } } } if ($passed_blame_to_parent && $self->{UNKNOWN}) { unshift @{$self->{WORK}}, [$parent, $filename_in_parent]; } } sub assign { my ($self, $commit, $filename) = @_; # We do read-tree of the current commit and diff-cache # with each parents, instead of running diff-tree. This # is because diff-tree does not look for copies hard enough. if (exists $self->{'PATHINFO'} && exists $self->{'PATHINFO'}{$filename} && !exists $self->{'PATHINFO'}{$filename}{$commit->{SHA1}} && @{$commit->{PARENT}} == 1) { # This commit did not touch the path at all, and # has only one parent. It is all that parent's fault. my $parent = Git::Commit->new($commit->{PARENT}[0]); my $passed_blame_to_parent = 0; for (my $i = 0; $i < @{$self->{LINE}}; $i++) { my $l = $self->{LINE}[$i]; if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) { $l->{COMMIT} = $parent; $passed_blame_to_parent = 1; } } if ($passed_blame_to_parent && $self->{UNKNOWN}) { unshift @{$self->{WORK}}, [$parent, $filename]; } return; } print STDERR "* read-tree $commit->{SHA1}\n" if $::debug; system('git-read-tree', '-m', $commit->{SHA1}); for my $parent (@{$commit->{PARENT}}) { $self->blame_parent($commit, Git::Commit->new($parent), $filename); } $self->take_responsibility($commit); } sub assign_blame { my ($self) = @_; while ($self->{UNKNOWN} && @{$self->{WORK}}) { my $wk = shift @{$self->{WORK}}; my ($commit, $filename) = @$wk; $self->assign($commit, $filename); } } ################################################################ package main; my $usage = "blame [-z] <commit> filename"; my $line_termination = "\n"; $::ENV{GIT_INDEX_FILE} = "/tmp/blame-$$-index"; unlink($::ENV{GIT_INDEX_FILE}); if ($ARGV[0] eq '-z') { $line_termination = "\0"; shift; } if (@ARGV != 2) { die $usage; } my $head_commit = Git::Commit->new($ARGV[0]); my $filename = $ARGV[1]; my $blame = Git::Blame->new($head_commit, $filename); if (-f ".blame-cache") { $blame->read_blame_cache(".blame-cache"); } $blame->assign_blame(); $blame->print($line_termination); unlink($::ENV{GIT_INDEX_FILE}); __END__ How does this work, and what do we do about merges? The algorithm considers that the first parent is our main line of development and treats it somewhat special than other parents. So we pass on the blame to the first parent if a line has not changed from it. For lines that have changed from the first parent, we must have either inherited that change from some other parent, or it could have been merge conflict resolution edit we did on our own. The following picture illustrates how we pass on and assign blames. In the sample, the original O was forked into A and B and then merged into M. Line 1, 2, and 4 did not change. Line 3 and 5 are changed in A, and Line 5 and 6 are changed in B. M made its own decision to resolve merge conflicts at Line 5 to something different from A and B: A: 1 2 T 4 T 6 / \ O: 1 2 3 4 5 6 M: 1 2 T 4 M S \ / B: 1 2 3 4 S S In the following picture, each line is annotated with a blame letter. A lowercase blame (e.g. "a" for "1") means that commit or its ancestor is the guilty party but we do not know which particular ancestor is responsible for the change yet. An uppercase blame means that we know that commit is the guilty party. First we look at M (the HEAD) and initialize Git::Blame->{LINE} like this: M: 1 2 T 4 M S m m m m m m That is, we know all lines are results of modification made by some ancestor of M, so we assign lowercase 'm' to all of them. Then we examine our first parent A. Throughout the algorithm, we are always only interested in the lines we are the suspect, but this being the initial round, we are the suspect for all of them. We notice that 1 2 T 4 are the same as the parent A, so we pass the blame for these four lines to A. M and S are different from A, so we leave them as they are (note that we do not immediately take the blame for them): M: 1 2 T 4 M S a a a a m m Next we go on to examine parent B. Again, we are only interested in the lines we are still the suspect (i.e. M and S). We notice S is something we inherited from B, so we pass the blame on to it, like this: M: 1 2 T 4 M S a a a a m b Once we exhausted the parents, we look at the results and take responsibility for the remaining ones that we are still the suspect: M: 1 2 T 4 M S a a a a M b We are done with M. And we know commits A and B need to be examined further, so we do them recursively. When we look at A, we again only look at the lines that A is the suspect: A: 1 2 T 4 T 6 a a a a M b Among 1 2 T 4, comparing against its parent O, we notice 1 2 4 are the same so pass the blame for those lines to O: A: 1 2 T 4 T 6 o o a o M b A is a non-merge commit; we have already exhausted the parents and take responsibility for the remaining ones that A is the suspect: A: 1 2 T 4 T 6 o o A o M b We go on like this and the final result would become: O: 1 2 3 4 5 6 O O A O M B ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-04 23:06 ` Junio C Hamano @ 2005-07-05 1:54 ` Daniel Barkalow 2005-07-05 6:24 ` Junio C Hamano 0 siblings, 1 reply; 39+ messages in thread From: Daniel Barkalow @ 2005-07-05 1:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Mon, 4 Jul 2005, Junio C Hamano wrote: > >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes: > > DB> Junio: what's missing from annotate/blame? > > Which one are you talking about? > > What I use to generate http://members.cox.net/junkio/Summary.txt > is an implementation of an algorithm I consider "complete" in > that it does rename/copy and complete rewrite correctly. What > is missing from the implementation is efficiency. [perl script] > How does this work, and what do we do about merges? I've got that part, but I'm not clear on how the rename/copy and complete rewrite stuff works. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-05 1:54 ` Daniel Barkalow @ 2005-07-05 6:24 ` Junio C Hamano 2005-07-05 13:34 ` Marco Costalba 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2005-07-05 6:24 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Linus Torvalds, git >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes: DB> [perl script] >> How does this work, and what do we do about merges? DB> I've got that part, but I'm not clear on how the rename/copy and complete DB> rewrite stuff works. Rename/copy is simply about what files to use when comparing between two trees. If you are tracking file F and find that a commit created that file by renaming what was originally G, then you compare old G and this F and assign blame for common lines to the ancestor of that commit (that had those lines in G), and take responsibility of the rest yourself (or assign blame for them to other parents). Rewrite is easier. If diff-tree -B says it is a complete rewrite, instead of assigning blame for lines that you are the current suspect to any of your parents, you take blame for all of them yourself, without comparing the file with the parent to assign blame line-by-line. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Last mile for 1.0 again 2005-07-05 6:24 ` Junio C Hamano @ 2005-07-05 13:34 ` Marco Costalba 0 siblings, 0 replies; 39+ messages in thread From: Marco Costalba @ 2005-07-05 13:34 UTC (permalink / raw) To: Junio C Hamano, Daniel Barkalow; +Cc: Linus Torvalds, git --- Junio C Hamano <junkio@cox.net> wrote: > >>>>> "DB" == Daniel Barkalow <barkalow@iabervon.org> writes: > > DB> [perl script] > > >> How does this work, and what do we do about merges? > Checking diffs of all the parents can be computational expensive. I'am developing a different alghoritm for qgit, where I know, before to run annotate all the revs that changed a given file. The revs are saved in the shaHist list according to git-rev-list --merge-order. The algoritm operates form the oldest to the newst calculating annotation for each following rev. This alghoritm takes advantage of the fact that mostly revs belong to 2 categories: 1) revs of the same development-sequence (e.g. connected without merges) 2) empty merges, i.e. merges where the file we are interested in is listed in only one merging branch If a rev in shaHist does not belong to above categories a reach analisys must be performed in order to find the direct ancestor among all the previous revs in the list. So starting from shaHist a ReachList rl is populated by getReachability(): void Annotate::getReachability(ReachList& rl, const QString& file, const QStringList& shaHist) { rl.clear(); QStringList::const_iterator it = shaHist.begin(); Rev* r = revLookup(*it); rl.append(ReachInfo(*it, r->id, INITIAL)); for (it++; it != shaHist.end(); it++) { Rev* r = revLookup(*it); // initial revision case if (r->parents.count() == 0) { rl.append(ReachInfo(*it, r->id, INITIAL)); continue; } // merge case if (r->parents.count() > 1) { // check empty merges diffing against previous in list QString prevSha = rl.last().sha; QString d = getFileDiff(*it, prevSha, file); if (d.isEmpty()) { rl.append(ReachInfo(*it, r->id, EMPTY_MERGE)); rl.last().roots.append(prevSha); } else { // real merge: we need reach test. QStringList roots; roots.clear(); for (uint i = 0; i < r->parents.count(); i++) { // here tree walking is performed QString root = getRoot(r->parents[i], rl); if (!root.isEmpty()) roots.append(root); } rl.append(ReachInfo(*it, r->id, MERGE)); rl.last().roots = roots; } } continue; } // normal case: a revision with a single parent if (r->id == rl.last().id) { // same development sequence QString prevSha = rl.last().sha; rl.append(ReachInfo(*it, r->id, SAME_BR)); // same branch rl.last().roots.append(prevSha); } else { // different development sequence QString root = getRoot(*it, rl); // <-- here tree walking is performed if (!root.isEmpty()) { rl.append(ReachInfo(*it, r->id, DIFF_BR)); rl.last().roots.append(root); } else qDebug("ASSERT getReachability: rev %s not reachable", (*it).latin1()); } } } Then ReachList rl is then used to compute correct annotations: void Annotate::run(const QString& file, const QStringList& shaHist, AnnotateHistory& ah) { QString d, diffTarget; QStringList t; int pos; ReachList rl; ah.clear(); getReachability(rl, file, shaHist); // <-- here ReachList rl is calculated for (uint i = 0; i < rl.count(); i++) { switch(rl[i].type) { // <-- here ReachList rl is used to find annotations case SAME_BR: case DIFF_BR: diffTarget = rl[i].roots.first(); d = getFileDiff(rl[i].sha, diffTarget, file); pos = shaHist.findIndex(diffTarget); ah.append(processDiff(d, ah[pos], getAuthor(rl[i].sha, shaHist))); break; case INITIAL: pos = shaHist.findIndex(rl[i].sha); ah.append(getFirstAnnotation(file, shaHist, pos)); break; case EMPTY_MERGE: diffTarget = rl[i].roots.first(); pos = shaHist.findIndex(diffTarget); t = ah[pos]; // copy annotation from previous ah.append(t); break; case MERGE: // append annotation calculated on first root diffTarget = rl[i].roots.first(); d = getFileDiff(rl[i].sha, diffTarget, file); pos = shaHist.findIndex(diffTarget); ah.append(processDiff(d, ah[pos], "Merge")); // update annotation with all the others roots for (uint j = 1; j < rl[i].roots.count(); j++) { diffTarget = rl[i].roots[j]; d = getFileDiff(rl[i].sha, diffTarget, file); pos = shaHist.findIndex(diffTarget); // create an annotation calculated between root[i] and // the merge. QStringList scndAnn = processDiff(d, ah[pos], getAuthor(diffTarget, shaHist)); // finally we unify the two annotations, so we catch // also the case of a 3-way merge unify(ah.last(), scndAnn); } break; } } } Advantage of this algorithm is that reach analisys, i.e. tree walking, is done, statistically, only for a small subset of revs. Another side benefit is that correct annotation is calculated for each rev and not only for newest one. Of course there are issues: 1) A ordered list of revs (file history) must be known in advance. This is not a problem for qgit because file history is calculated while loading revs. 2) Does not takes in account file renames. Hope this notes can help in ongoing discussion Marco __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding 2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller 2005-06-25 4:40 ` Jeff Garzik @ 2005-06-25 5:04 ` Junio C Hamano 1 sibling, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2005-06-25 5:04 UTC (permalink / raw) To: David S. Miller; +Cc: git >>>>> "DSM" == David S Miller <davem@davemloft.net> writes: DSM> To get a clean history to push to Linus, I typically blow DSM> away my trees and make fresh ones to stick patches into DSM> which I want to merge. DSM> Should I: DSM> 1) Do a git pull from Linus's tree once he takes my changes, then DSM> ask GIT to prune the tree? How do I do that and how does it work? DSM> 2) Should I use .git/object/ database symlinking? DSM> Are there any scripts out there which do this automatically? DSM> Something as simple to run as "git-pull-script" and it takes DSM> care of using links when possible on a local filesystem. git-pull-script internally uses git-fetch-script which knows how to do the local tree using hardlinks. Presumably, the following workflow would work: (1) You hack away in your private tree, while you keep a "to be published" clean tree, both on your local machine. (2) Do a GIT pull, merge in your private tree, to come up with a clean set of changes in your private tree. This is the tree you "typically blow away". Reordering the commits to come up with a clean history since you last pulled from Linus would also happen in this tree. (3) Once you have a commit that you want to publish (i.e. the commit chain between that commit and the point you last pulled from Linus is the "clean history to push to Linus"), you go to your "to be published" clean tree, and run git-fetch-script to fetch the commit you want to publish from your private tree. When you give an absolute path as the "remote repo", git-local-pull with linking behaviour is used by git-fetch-script; otherwise rsync backend is used so you end up polluted object database. This way you copy only the clean stuff from your private tree. Your HEAD in this tree should be set to the commit you wanted to publish. Running git-prune would be nicer but if your history is truly clean it should not be necessary. (4) Garbage collecting with git-prune your private tree is your business. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: kernel.org and GIT tree rebuilding
@ 2005-07-03 2:51 linux
0 siblings, 0 replies; 39+ messages in thread
From: linux @ 2005-07-03 2:51 UTC (permalink / raw)
To: git, torvalds
> unsigned long size;
> unsigned char c;
>
> c = *pack++;
> size = c & 15;
> type = (c >> 4) & 7;
> while (c & 0x80) {
> c = *pack++;
> size = (size << 7) + (c & 0x7f);
> }
>
> or something. That's even denser.
If you're going for density, you missed something. Try:
c = *pack++;
size = c & 15;
type = (c >> 4) & 7;
while (c & 0x80) {
c = *pack++;
size = (size << 7) + (c & 0x7f) + 16;
}
Encoding is most easily done in little-endian order, such as:
static unsigned
encode(unsigned char *p, unsigned long x)
{
unsigned char *q = p;
unsigned char buf[5];
unsigned char *b = buf;
while (x > 15) {
assert(b < buf+5);
x -= 16;
*b++ = x & 0x7f;
x >>= 7;
}
*b = x;
while (b != buf)
*q++ = *b-- | 0x80;
*q++ = *b;
return (unsigned)(q - p);
}
(You'll probably want to rewrite the above, but it's abandoned to the
public domain in any case. Go nuts.)
^ permalink raw reply [flat|nested] 39+ messages in threadend of thread, other threads:[~2005-07-05 13:35 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25 4:40 ` Jeff Garzik
2005-06-25 5:23 ` Linus Torvalds
2005-06-25 5:48 ` Jeff Garzik
2005-06-25 6:16 ` Linus Torvalds
2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
2005-06-26 19:19 ` Linus Torvalds
2005-06-26 19:45 ` Junio C Hamano
[not found] ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
[not found] ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
[not found] ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03 ` Chris Mason
2005-06-26 21:40 ` Linus Torvalds
2005-06-26 22:34 ` Linus Torvalds
2005-06-28 18:06 ` Nicolas Pitre
2005-06-28 19:28 ` Linus Torvalds
2005-06-28 21:08 ` Nicolas Pitre
2005-06-28 21:27 ` Linus Torvalds
2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29 5:16 ` Nicolas Pitre
2005-06-29 5:43 ` Linus Torvalds
2005-06-29 5:54 ` Linus Torvalds
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15 ` Linus Torvalds
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45 ` Junio C Hamano
2005-07-04 21:59 ` Linus Torvalds
2005-07-04 22:41 ` Daniel Barkalow
2005-07-04 23:06 ` Junio C Hamano
2005-07-05 1:54 ` Daniel Barkalow
2005-07-05 6:24 ` Junio C Hamano
2005-07-05 13:34 ` Marco Costalba
2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2005-07-03 2:51 linux
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).