* Pack v4 again.. @ 2015-02-13 10:46 Duy Nguyen 2015-02-16 4:59 ` Nicolas Pitre 0 siblings, 1 reply; 7+ messages in thread From: Duy Nguyen @ 2015-02-13 10:46 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List After taking 1.5 years "vacation" from pack v4, I plan to do something about it again. Will post more when I have some patches to discuss. Only one question for now (forgive me if I asked already, it's been quite some time) I think pack v4 does not deliver its best promise that walking a tree is simply following pointers and jumping from place to place. When we want to copy from the middle of another tree, we need to scan from the beginning of the tree. Tree offset cache helps, but the problem remains. What do you think about an alternative format that each "copy" instruction includes both index of the tree entry to copy from (i.e. what we store now) _and_ the byte offset from the beginning of the tree? With this byte offset, we know exactly where to start copying without scanning from the beginning. It will be a bit(?) bigger, but it's also faster. I imagine this is an optimization that can be done locally. The pack transferred over network does not have these byte offsets. After the pack is stored and verified by index-pack, we can rewrite it and add this info. The simplest way is use a fixed size for this offset (e.g. uint16_t or even uint8_t), add the place holder in copy instructions of all v4 trees. After that object offsets will not change again and we can start filling real offsets to placeholders. PS. The rebased version on recent master is here if anyone is interested https://github.com/pclouds/git/commits/pack-v4 -- Duy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-13 10:46 Pack v4 again Duy Nguyen @ 2015-02-16 4:59 ` Nicolas Pitre 2015-02-16 6:45 ` Jeff King 2015-02-16 10:11 ` Duy Nguyen 0 siblings, 2 replies; 7+ messages in thread From: Nicolas Pitre @ 2015-02-16 4:59 UTC (permalink / raw) To: Duy Nguyen; +Cc: Git Mailing List On Fri, 13 Feb 2015, Duy Nguyen wrote: > After taking 1.5 years "vacation" from pack v4, I plan to do something > about it again. Will post more when I have some patches to discuss. > Only one question for now (forgive me if I asked already, it's been > quite some time) Yeah. I had to re-study my own code before replying. > I think pack v4 does not deliver its best promise that walking a tree > is simply following pointers and jumping from place to place. When we > want to copy from the middle of another tree, we need to scan from the > beginning of the tree. Tree offset cache helps, but the problem > remains. What do you think about an alternative format that each > "copy" instruction includes both index of the tree entry to copy from > (i.e. what we store now) _and_ the byte offset from the beginning of > the tree? With this byte offset, we know exactly where to start > copying without scanning from the beginning. It will be a bit(?) > bigger, but it's also faster. That would make the format inflexible. If we want to do partial repacking by, say, copying some objects and repacking others (some objects might need repacking because the objects they refer to are omitted from the repack) then if those repacked objects are themselves referred to by byte offset then we lose as the offset is no longer valid. > I imagine this is an optimization that can be done locally. The pack > transferred over network does not have these byte offsets. After the > pack is stored and verified by index-pack, we can rewrite it and add > this info. The simplest way is use a fixed size for this offset (e.g. > uint16_t or even uint8_t), add the place holder in copy instructions > of all v4 trees. After that object offsets will not change again and > we can start filling real offsets to placeholders. Having a local extra index is fine. Just like the pack index which is always created locally and can be recreated at any time. Some tree offset cache might be beneficial, but I'd avoid making it into the pack file itself. Yet, I think the biggest problem with pack v4 at the moment is the packing algorithm for tree objects. We are piggy-backing on the pack v2 object delta compression sorting and that produces suboptimal results due to deep recursions. And it is the recursion that kills us. The pack v4 requires a new packing algorithm for its tree objects. What I imagined is something like this: - Each canonical tree entry is made of a SHA1, mode and path. Let's assume this is hashed into a 24-bit value. - Each tree object can therefore be represented as a string of 24-bit "characters". - Delta-compressing a tree object becomes a substring search where we try to replace a sequence of "characters" with the longest "string" possible from another object. Repeat with the remaining sequences. Having a 24-bit hash value is totally arbitrary. It could be 16 bits with more collisions but much faster search and less memory usage. The optimal value would need to be determined after some experimentation. Algorithms for the longest common substring problem already exist. So one of the classical algorithms could probably be adapted here. This would allow for exploiting the provision in pack v4 to copy from more than one tree object. And this would also favor shallower recursions and even smaller packs. Imposing a minimum substring length (rather than a maximum delta depth) would determine the runtime performance when using the pack afterwards. If you have enough free cycles to work on this, that's what I'd suggest you explore at this point. I wish I could myself as I think this ought to be rather cool work. Nicolas ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-16 4:59 ` Nicolas Pitre @ 2015-02-16 6:45 ` Jeff King 2015-02-16 7:12 ` Junio C Hamano ` (2 more replies) 2015-02-16 10:11 ` Duy Nguyen 1 sibling, 3 replies; 7+ messages in thread From: Jeff King @ 2015-02-16 6:45 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Duy Nguyen, Git Mailing List On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote: > Yet, I think the biggest problem with pack v4 at the moment is the > packing algorithm for tree objects. We are piggy-backing on the pack v2 > object delta compression sorting and that produces suboptimal results > due to deep recursions. And it is the recursion that kills us. The pack > v4 requires a new packing algorithm for its tree objects. > > What I imagined is something like this: > > - Each canonical tree entry is made of a SHA1, mode and path. Let's > assume this is hashed into a 24-bit value. > > - Each tree object can therefore be represented as a string of 24-bit > "characters". > > - Delta-compressing a tree object becomes a substring search where we > try to replace a sequence of "characters" with the longest "string" > possible from another object. Repeat with the remaining sequences. Somewhat related to this, I was playing this weekend with the idea of generating fast tree diffs from our on-disk deltas. That is, given a base tree and a binary delta against it, could I reliably reproduce a diff (one way or the other) in O(size of diff), rather than O(size of tree)? I have two applications in mind: 1. Tree diffs for "git log -- pathspec" do the same diff (between a tree and its corresponding version in the parent) over and over. A significant percentage of the time (at least in linux.git and git.git), we store deltas between these trees. We could stop wasting time walking the common parts of the trees, and jump right to the differences. 2. Calculating reachability for packing[1] spends a lot of time in lookup_object, as we have to go through each tree saying "have we seen object 1234abcd yet?". If we could instead just view the differences, we would not have to make those hash lookups for entries whose objects we know we have seen. The conclusion I came to was "no, you cannot do it in the general case of byte-wise binary diffs"[2]. Conceptually, it's not too difficult. For example, if we know we have copied up to byte X from the source tree, then we skip to byte Y, we know that bytes Y-X did not make it, and are a deletion of entries. If those fall on whole-entry boundaries, it's pretty simple. But of course we have no guarantee that they do in a byte-wise diff. E.g., if entry "foo" changes its sha1, then we will generally copy up to and including the "foo\0" from the source, and then insert our own new sha1. Which means we need to walk backwards from point "Y" to find the filename of the entry we are in. Or worse, the sha1s might share one or more leading bytes, and that would be part of the copy, too. So in short, we may end up anywhere within a tree object and need to walk backwards to the start of the entry. But we cannot walk backwards looking for a NUL, because it may also be part of a sha1. You can only reliably parse a git tree by starting at the beginning of the stream and going forwards[3]. If we knew that our deltas were always produced on entry-boundaries (a "character" in your description above), this would be much simpler. Maybe this is all stuff you already thought through, but if not, food for thought. :) -Peff [1] Of course there are other reachability checks besides packing, like git-prune. But all of those are even better sped up by using reachability bitmaps. Packing, as the place where we generate the bitmaps, and which is sensitive to things like traversal order, remains the place where we will always need to actually walk. [2] One option, of course, is to generate byte-wise deltas, but with a promise to always align them on entry boundaries. I'm tempted by this, because the result would be readable by existing packv2 readers. We'd have to set a flag somewhere that indicates the pack was written with this property, though. [3] I suspect you could come up with some heuristic that finds tree entry boundaries with high probability, and in the low probability case does not produce a wrong answer, but instead has to walk all the way back to the beginning of the tree. That would be fine here. But frankly, this "walk backwards" thing was just the straw that broke the camel's back for me in working on this. Handling all the possible cases was ending up quite complicated. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-16 6:45 ` Jeff King @ 2015-02-16 7:12 ` Junio C Hamano 2015-02-16 10:48 ` Duy Nguyen 2015-02-17 4:12 ` Shawn Pearce 2 siblings, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2015-02-16 7:12 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Duy Nguyen, Git Mailing List Jeff King <peff@peff.net> writes: > 2. Calculating reachability for packing[1] spends a lot of time in > lookup_object, as we have to go through each tree saying "have we > seen object 1234abcd yet?". If we could instead just view the > differences, we would not have to make those hash lookups for > entries whose objects we know we have seen. Hmm, that is an interesting idea. > So in short, we may end up anywhere within a tree object and need to > walk backwards to the start of the entry. But we cannot walk backwards > looking for a NUL, because it may also be part of a sha1. You can only > reliably parse a git tree by starting at the beginning of the stream and > going forwards[3]. That in general is true for non deltified tree object (you cannot take advantage of the fact that entries are sorted by pathname to bisect to directly jump to an entry). > If we knew that our deltas were always produced on entry-boundaries (a > "character" in your description above), this would be much simpler. ;-) > [2] One option, of course, is to generate byte-wise deltas, but with a > promise to always align them on entry boundaries. I'm tempted by > this, because the result would be readable by existing packv2 > readers. We'd have to set a flag somewhere that indicates the pack > was written with this property, though. Yes, that may work. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-16 6:45 ` Jeff King 2015-02-16 7:12 ` Junio C Hamano @ 2015-02-16 10:48 ` Duy Nguyen 2015-02-17 4:12 ` Shawn Pearce 2 siblings, 0 replies; 7+ messages in thread From: Duy Nguyen @ 2015-02-16 10:48 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Git Mailing List On Mon, Feb 16, 2015 at 1:45 PM, Jeff King <peff@peff.net> wrote: > Somewhat related to this, I was playing this weekend with the idea of > generating fast tree diffs from our on-disk deltas. That is, given a > base tree and a binary delta against it, could I reliably reproduce a > diff (one way or the other) in O(size of diff), rather than O(size of > tree)? If you add a "delta" base cache for v4 trees to avoid the recursion issue Nico mentioned, you effectively have a "diff" that aligns at tree entry boundaries. The v4 tree encoding will become just another delta format from this perspective. I'm very tempted to just go with this to take advantage of v4 SHA-1 encoding, ident and path encoding.. -- Duy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-16 6:45 ` Jeff King 2015-02-16 7:12 ` Junio C Hamano 2015-02-16 10:48 ` Duy Nguyen @ 2015-02-17 4:12 ` Shawn Pearce 2 siblings, 0 replies; 7+ messages in thread From: Shawn Pearce @ 2015-02-17 4:12 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Duy Nguyen, Git Mailing List On Sun, Feb 15, 2015 at 10:45 PM, Jeff King <peff@peff.net> wrote: > On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote: > >> Yet, I think the biggest problem with pack v4 at the moment is the >> packing algorithm for tree objects. We are piggy-backing on the pack v2 >> object delta compression sorting and that produces suboptimal results >> due to deep recursions. And it is the recursion that kills us. The pack >> v4 requires a new packing algorithm for its tree objects. >> >> What I imagined is something like this: >> >> - Each canonical tree entry is made of a SHA1, mode and path. Let's >> assume this is hashed into a 24-bit value. >> >> - Each tree object can therefore be represented as a string of 24-bit >> "characters". >> >> - Delta-compressing a tree object becomes a substring search where we >> try to replace a sequence of "characters" with the longest "string" >> possible from another object. Repeat with the remaining sequences. > > Somewhat related to this, I was playing this weekend with the idea of > generating fast tree diffs from our on-disk deltas. That is, given a > base tree and a binary delta against it, could I reliably reproduce a > diff (one way or the other) in O(size of diff), rather than O(size of > tree)? Yes, if you always make the tree diff *search* on entry boundaries. > The conclusion I came to was "no, you cannot do it in the general case > of byte-wise binary diffs"[2]. This is also why you cannot binary search inside of the canonical tree format. :( > If we knew that our deltas were always produced on entry-boundaries (a > "character" in your description above), this would be much simpler. Eons ago Nico and I were of the opinion that pack v4 trees could use the existing byte based delta format on disk, but the delta search/encoder would always align to fixed width entry boundaries. That gives you deltas that are understandable by the current decoder, but are also trivially processed in delta format as insertions and copy runs always cover complete entries and are never a partial entry. It was all theory; we never actually wrote a prototype of that. > [1] Of course there are other reachability checks besides packing, like > git-prune. But all of those are even better sped up by using > reachability bitmaps. Packing, as the place where we generate the > bitmaps, and which is sensitive to things like traversal order, > remains the place where we will always need to actually walk. `git log -- $path` isn't trivially improved with reachability bitmaps. Its something we have been pondering a lot at $DAY_JOB and haven't found a magic bullet solution for yet. Until someone comes up with another chunk of magic, we need tree diffs for a lot of operations. > [2] One option, of course, is to generate byte-wise deltas, but with a > promise to always align them on entry boundaries. I'm tempted by > this, because the result would be readable by existing packv2 > readers. We'd have to set a flag somewhere that indicates the pack > was written with this property, though. Yes, that was always something we wanted to look at doing. > [3] I suspect you could come up with some heuristic that finds tree > entry boundaries with high probability, and in the low probability > case does not produce a wrong answer, but instead has to walk all > the way back to the beginning of the tree. That would be fine here. > But frankly, this "walk backwards" thing was just the straw that > broke the camel's back for me in working on this. Handling all the > possible cases was ending up quite complicated. No, I tried this in JGit once. You can't do it reliably enough. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pack v4 again.. 2015-02-16 4:59 ` Nicolas Pitre 2015-02-16 6:45 ` Jeff King @ 2015-02-16 10:11 ` Duy Nguyen 1 sibling, 0 replies; 7+ messages in thread From: Duy Nguyen @ 2015-02-16 10:11 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List On Mon, Feb 16, 2015 at 11:59 AM, Nicolas Pitre <nico@fluxnic.net> wrote: >> I think pack v4 does not deliver its best promise that walking a tree >> is simply following pointers and jumping from place to place. When we >> want to copy from the middle of another tree, we need to scan from the >> beginning of the tree. Tree offset cache helps, but the problem >> remains. What do you think about an alternative format that each >> "copy" instruction includes both index of the tree entry to copy from >> (i.e. what we store now) _and_ the byte offset from the beginning of >> the tree? With this byte offset, we know exactly where to start >> copying without scanning from the beginning. It will be a bit(?) >> bigger, but it's also faster. > > That would make the format inflexible. If we want to do partial > repacking by, say, copying some objects and repacking others (some > objects might need repacking because the objects they refer to are > omitted from the repack) then if those repacked objects are themselves > referred to by byte offset then we lose as the offset is no longer > valid. When generate new packs, we could rip out these offsets, but it depends on whether we can reuse v4 trees unchanged like we do with v2 deltas. If we can, ripping is not an option because then we need to parse more. >> I imagine this is an optimization that can be done locally. The pack >> transferred over network does not have these byte offsets. After the >> pack is stored and verified by index-pack, we can rewrite it and add >> this info. The simplest way is use a fixed size for this offset (e.g. >> uint16_t or even uint8_t), add the place holder in copy instructions >> of all v4 trees. After that object offsets will not change again and >> we can start filling real offsets to placeholders. > > Having a local extra index is fine. Just like the pack index which is > always created locally and can be recreated at any time. Some tree > offset cache might be beneficial, but I'd avoid making it into the pack > file itself. Hm.. right. > Yet, I think the biggest problem with pack v4 at the moment is the > packing algorithm for tree objects. We are piggy-backing on the pack v2 > object delta compression sorting and that produces suboptimal results > due to deep recursions. And it is the recursion that kills us. The pack > v4 requires a new packing algorithm for its tree objects. Yep. I made a conversion tool a few days ago to "flatten" v4 trees. So if tree A copies some entries from B, but then B copies from C, tree A could copy directly from C. Performance improves significantly (close to v2 with rev-list, but still slower). But pack size doubles because copy sequences are fragmented and we're duplicating same copy patterns over and over again. All because we follow the single delta chain decided since v2 time so we only have one tree with no copy sequences (best to copy from). > What I imagined is something like this: > > - Each canonical tree entry is made of a SHA1, mode and path. Let's > assume this is hashed into a 24-bit value. > > - Each tree object can therefore be represented as a string of 24-bit > "characters". > > - Delta-compressing a tree object becomes a substring search where we > try to replace a sequence of "characters" with the longest "string" > possible from another object. Repeat with the remaining sequences. > > Having a 24-bit hash value is totally arbitrary. It could be 16 bits > with more collisions but much faster search and less memory usage. The > optimal value would need to be determined after some experimentation. > > Algorithms for the longest common substring problem already exist. So > one of the classical algorithms could probably be adapted here. > > This would allow for exploiting the provision in pack v4 to copy from > more than one tree object. And this would also favor shallower > recursions and even smaller packs. Imposing a minimum substring length > (rather than a maximum delta depth) would determine the runtime > performance when using the pack afterwards. > > If you have enough free cycles to work on this, that's what I'd suggest > you explore at this point. I wish I could myself as I think this ought > to be rather cool work. I'm getting there. I'm writing an alternative implementation to your pv4_encode_tree() that takes multiple base trees instead of just one, finding all copy sequences from these bases, then somehow pick the best ones. After trees are sorted by similarity in pack-objects, we preserve n trees as base trees (no copy sequences). There is a window to feed some of these base trees to this encode function, like how we do it in try_delta(). Your encoding tree entries as strings would be faster, but that's not the immediate problem. Using the longest common substring is kinda greedy (exactly what I'm thinking to do), but it probably produces a suboptimal copy sequences. Maybe using two shorter copy sequences would reduce fragmentation than one large copy sequence and a bunch of individual tree entries, that sort of thing. Haven't worked it out yet.. -- Duy ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-02-17 4:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-13 10:46 Pack v4 again Duy Nguyen 2015-02-16 4:59 ` Nicolas Pitre 2015-02-16 6:45 ` Jeff King 2015-02-16 7:12 ` Junio C Hamano 2015-02-16 10:48 ` Duy Nguyen 2015-02-17 4:12 ` Shawn Pearce 2015-02-16 10:11 ` Duy Nguyen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).