Pack v4 again..

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Pack v4 again..
@ 2015-02-13 10:46 Duy Nguyen
  2015-02-16  4:59 ` Nicolas Pitre
  0 siblings, 1 reply; 7+ messages in thread
From: Duy Nguyen @ 2015-02-13 10:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List

After taking 1.5 years "vacation" from pack v4, I plan to do something
about it again. Will post more when I have some patches to discuss.
Only one question for now (forgive me if I asked already, it's been
quite some time)

I think pack v4 does not deliver its best promise that walking a tree
is simply following pointers and jumping from place to place. When we
want to copy from the middle of another tree, we need to scan from the
beginning of the tree. Tree offset cache helps, but the problem
remains. What do you think about an alternative format that each
"copy" instruction includes both index of the tree entry to copy from
(i.e. what we store now)  _and_ the byte offset from the beginning of
the tree? With this byte offset, we know exactly where to start
copying without scanning from the beginning. It will be a bit(?)
bigger, but it's also faster.

I imagine this is an optimization that can be done locally. The pack
transferred over network does not have these byte offsets. After the
pack is stored and verified by index-pack, we can rewrite it and add
this info. The simplest way is use a fixed size for this offset (e.g.
uint16_t or even uint8_t), add the place holder in copy instructions
of all v4 trees. After that object offsets will not change again and
we can start filling real offsets to placeholders.

PS. The rebased version on recent master is here if anyone is interested

https://github.com/pclouds/git/commits/pack-v4
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-13 10:46 Pack v4 again Duy Nguyen
@ 2015-02-16  4:59 ` Nicolas Pitre
  2015-02-16  6:45   ` Jeff King
  2015-02-16 10:11   ` Duy Nguyen
  0 siblings, 2 replies; 7+ messages in thread
From: Nicolas Pitre @ 2015-02-16  4:59 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

On Fri, 13 Feb 2015, Duy Nguyen wrote:

> After taking 1.5 years "vacation" from pack v4, I plan to do something
> about it again. Will post more when I have some patches to discuss.
> Only one question for now (forgive me if I asked already, it's been
> quite some time)

Yeah.  I had to re-study my own code before replying.

> I think pack v4 does not deliver its best promise that walking a tree
> is simply following pointers and jumping from place to place. When we
> want to copy from the middle of another tree, we need to scan from the
> beginning of the tree. Tree offset cache helps, but the problem
> remains. What do you think about an alternative format that each
> "copy" instruction includes both index of the tree entry to copy from
> (i.e. what we store now)  _and_ the byte offset from the beginning of
> the tree? With this byte offset, we know exactly where to start
> copying without scanning from the beginning. It will be a bit(?)
> bigger, but it's also faster.

That would make the format inflexible.  If we want to do partial 
repacking by, say, copying some objects and repacking others (some 
objects might need repacking because the objects they refer to are 
omitted from the repack) then if those repacked objects are themselves 
referred to by byte offset then we lose as the offset is no longer 
valid.

> I imagine this is an optimization that can be done locally. The pack
> transferred over network does not have these byte offsets. After the
> pack is stored and verified by index-pack, we can rewrite it and add
> this info. The simplest way is use a fixed size for this offset (e.g.
> uint16_t or even uint8_t), add the place holder in copy instructions
> of all v4 trees. After that object offsets will not change again and
> we can start filling real offsets to placeholders.

Having a local extra index is fine.  Just like the pack index which is 
always created locally and can be recreated at any time.  Some tree 
offset cache might be beneficial, but I'd avoid making it into the pack 
file itself.

Yet, I think the biggest problem with pack v4 at the moment is the 
packing algorithm for tree objects.  We are piggy-backing on the pack v2 
object delta compression sorting and that produces suboptimal results 
due to deep recursions.  And it is the recursion that kills us. The pack 
v4 requires a new packing algorithm for its tree objects.

What I imagined is something like this:

- Each canonical tree entry is made of a SHA1, mode and path.  Let's 
  assume this is hashed into a 24-bit value.

- Each tree object can therefore be represented as a string of 24-bit 
  "characters".

- Delta-compressing a tree object becomes a substring search where we 
  try to replace a sequence of "characters" with the longest "string" 
  possible from another object.  Repeat with the remaining sequences.

Having a 24-bit hash value is totally arbitrary.  It could be 16 bits 
with more collisions but much faster search and less memory usage.  The 
optimal value would need to be determined after some experimentation.

Algorithms for the longest common substring problem already exist.  So 
one of the classical algorithms could probably be adapted here.

This would allow for exploiting the provision in pack v4 to copy from 
more than one tree object.  And this would also favor shallower 
recursions and even smaller packs.  Imposing a minimum substring length 
(rather than a maximum delta depth) would determine the runtime 
performance when using the pack afterwards.

If you have enough free cycles to work on this, that's what I'd suggest 
you explore at this point. I wish I could myself as I think this ought 
to be rather cool work.

Nicolas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-16  4:59 ` Nicolas Pitre
@ 2015-02-16  6:45   ` Jeff King
  2015-02-16  7:12     ` Junio C Hamano
                       ` (2 more replies)
  2015-02-16 10:11   ` Duy Nguyen
  1 sibling, 3 replies; 7+ messages in thread
From: Jeff King @ 2015-02-16  6:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Duy Nguyen, Git Mailing List

On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote:

> Yet, I think the biggest problem with pack v4 at the moment is the 
> packing algorithm for tree objects.  We are piggy-backing on the pack v2 
> object delta compression sorting and that produces suboptimal results 
> due to deep recursions.  And it is the recursion that kills us. The pack 
> v4 requires a new packing algorithm for its tree objects.
> 
> What I imagined is something like this:
> 
> - Each canonical tree entry is made of a SHA1, mode and path.  Let's 
>   assume this is hashed into a 24-bit value.
> 
> - Each tree object can therefore be represented as a string of 24-bit 
>   "characters".
> 
> - Delta-compressing a tree object becomes a substring search where we 
>   try to replace a sequence of "characters" with the longest "string" 
>   possible from another object.  Repeat with the remaining sequences.

Somewhat related to this, I was playing this weekend with the idea of
generating fast tree diffs from our on-disk deltas. That is, given a
base tree and a binary delta against it, could I reliably reproduce a
diff (one way or the other) in O(size of diff), rather than O(size of
tree)?

I have two applications in mind:

  1. Tree diffs for "git log -- pathspec" do the same diff (between a
     tree and its corresponding version in the parent) over and over.
     A significant percentage of the time (at least in linux.git and
     git.git), we store deltas between these trees.  We could stop
     wasting time walking the common parts of the trees, and jump right
     to the differences.

  2. Calculating reachability for packing[1] spends a lot of time in
     lookup_object, as we have to go through each tree saying "have we
     seen object 1234abcd yet?". If we could instead just view the
     differences, we would not have to make those hash lookups for
     entries whose objects we know we have seen.

The conclusion I came to was "no, you cannot do it in the general case
of byte-wise binary diffs"[2].  Conceptually, it's not too difficult. For
example, if we know we have copied up to byte X from the source tree,
then we skip to byte Y, we know that bytes Y-X did not make it, and are
a deletion of entries. If those fall on whole-entry boundaries, it's
pretty simple. But of course we have no guarantee that they do in a
byte-wise diff. E.g., if entry "foo" changes its sha1, then we will
generally copy up to and including the "foo\0" from the source, and then
insert our own new sha1. Which means we need to walk backwards from
point "Y" to find the filename of the entry we are in. Or worse, the
sha1s might share one or more leading bytes, and that would be part of
the copy, too.

So in short, we may end up anywhere within a tree object and need to
walk backwards to the start of the entry. But we cannot walk backwards
looking for a NUL, because it may also be part of a sha1. You can only
reliably parse a git tree by starting at the beginning of the stream and
going forwards[3].

If we knew that our deltas were always produced on entry-boundaries (a
"character" in your description above), this would be much simpler.

Maybe this is all stuff you already thought through, but if not, food
for thought. :)

-Peff

[1] Of course there are other reachability checks besides packing, like
    git-prune. But all of those are even better sped up by using
    reachability bitmaps. Packing, as the place where we generate the
    bitmaps, and which is sensitive to things like traversal order,
    remains the place where we will always need to actually walk.

[2] One option, of course, is to generate byte-wise deltas, but with a
    promise to always align them on entry boundaries. I'm tempted by
    this, because the result would be readable by existing packv2
    readers. We'd have to set a flag somewhere that indicates the pack
    was written with this property, though.

[3] I suspect you could come up with some heuristic that finds tree
    entry boundaries with high probability, and in the low probability
    case does not produce a wrong answer, but instead has to walk all
    the way back to the beginning of the tree. That would be fine here.
    But frankly, this "walk backwards" thing was just the straw that
    broke the camel's back for me in working on this. Handling all the
    possible cases was ending up quite complicated.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-16  6:45   ` Jeff King
@ 2015-02-16  7:12     ` Junio C Hamano
  2015-02-16 10:48     ` Duy Nguyen
  2015-02-17  4:12     ` Shawn Pearce
  2 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2015-02-16  7:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Duy Nguyen, Git Mailing List

Jeff King <peff@peff.net> writes:

>   2. Calculating reachability for packing[1] spends a lot of time in
>      lookup_object, as we have to go through each tree saying "have we
>      seen object 1234abcd yet?". If we could instead just view the
>      differences, we would not have to make those hash lookups for
>      entries whose objects we know we have seen.

Hmm, that is an interesting idea.

> So in short, we may end up anywhere within a tree object and need to
> walk backwards to the start of the entry. But we cannot walk backwards
> looking for a NUL, because it may also be part of a sha1. You can only
> reliably parse a git tree by starting at the beginning of the stream and
> going forwards[3].

That in general is true for non deltified tree object (you cannot
take advantage of the fact that entries are sorted by pathname to
bisect to directly jump to an entry).

> If we knew that our deltas were always produced on entry-boundaries (a
> "character" in your description above), this would be much simpler.

;-)

> [2] One option, of course, is to generate byte-wise deltas, but with a
>     promise to always align them on entry boundaries. I'm tempted by
>     this, because the result would be readable by existing packv2
>     readers. We'd have to set a flag somewhere that indicates the pack
>     was written with this property, though.

Yes, that may work.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-16  6:45   ` Jeff King
  2015-02-16  7:12     ` Junio C Hamano
@ 2015-02-16 10:48     ` Duy Nguyen
  2015-02-17  4:12     ` Shawn Pearce
  2 siblings, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2015-02-16 10:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Git Mailing List

On Mon, Feb 16, 2015 at 1:45 PM, Jeff King <peff@peff.net> wrote:
> Somewhat related to this, I was playing this weekend with the idea of
> generating fast tree diffs from our on-disk deltas. That is, given a
> base tree and a binary delta against it, could I reliably reproduce a
> diff (one way or the other) in O(size of diff), rather than O(size of
> tree)?

If you add a "delta" base cache for v4 trees to avoid the recursion
issue Nico mentioned, you effectively have a "diff" that aligns at
tree entry boundaries. The v4 tree encoding will become just another
delta format from this perspective. I'm very tempted to just go with
this to take advantage of v4 SHA-1 encoding, ident and path encoding..
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-16  6:45   ` Jeff King
  2015-02-16  7:12     ` Junio C Hamano
  2015-02-16 10:48     ` Duy Nguyen
@ 2015-02-17  4:12     ` Shawn Pearce
  2 siblings, 0 replies; 7+ messages in thread
From: Shawn Pearce @ 2015-02-17  4:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Duy Nguyen, Git Mailing List

On Sun, Feb 15, 2015 at 10:45 PM, Jeff King <peff@peff.net> wrote:
> On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote:
>
>> Yet, I think the biggest problem with pack v4 at the moment is the
>> packing algorithm for tree objects.  We are piggy-backing on the pack v2
>> object delta compression sorting and that produces suboptimal results
>> due to deep recursions.  And it is the recursion that kills us. The pack
>> v4 requires a new packing algorithm for its tree objects.
>>
>> What I imagined is something like this:
>>
>> - Each canonical tree entry is made of a SHA1, mode and path.  Let's
>>   assume this is hashed into a 24-bit value.
>>
>> - Each tree object can therefore be represented as a string of 24-bit
>>   "characters".
>>
>> - Delta-compressing a tree object becomes a substring search where we
>>   try to replace a sequence of "characters" with the longest "string"
>>   possible from another object.  Repeat with the remaining sequences.
>
> Somewhat related to this, I was playing this weekend with the idea of
> generating fast tree diffs from our on-disk deltas. That is, given a
> base tree and a binary delta against it, could I reliably reproduce a
> diff (one way or the other) in O(size of diff), rather than O(size of
> tree)?

Yes, if you always make the tree diff *search* on entry boundaries.

> The conclusion I came to was "no, you cannot do it in the general case
> of byte-wise binary diffs"[2].

This is also why you cannot binary search inside of the canonical tree
format. :(

> If we knew that our deltas were always produced on entry-boundaries (a
> "character" in your description above), this would be much simpler.

Eons ago Nico and I were of the opinion that pack v4 trees could use
the existing byte based delta format on disk, but the delta
search/encoder would always align to fixed width entry boundaries.
That gives you deltas that are understandable by the current decoder,
but are also trivially processed in delta format as insertions and
copy runs always cover complete entries and are never a partial entry.

It was all theory; we never actually wrote a prototype of that.

> [1] Of course there are other reachability checks besides packing, like
>     git-prune. But all of those are even better sped up by using
>     reachability bitmaps. Packing, as the place where we generate the
>     bitmaps, and which is sensitive to things like traversal order,
>     remains the place where we will always need to actually walk.

`git log -- $path` isn't trivially improved with reachability bitmaps.
Its something we have been pondering a lot at $DAY_JOB and haven't
found a magic bullet solution for yet. Until someone comes up with
another chunk of magic, we need tree diffs for a lot of operations.

> [2] One option, of course, is to generate byte-wise deltas, but with a
>     promise to always align them on entry boundaries. I'm tempted by
>     this, because the result would be readable by existing packv2
>     readers. We'd have to set a flag somewhere that indicates the pack
>     was written with this property, though.

Yes, that was always something we wanted to look at doing.

> [3] I suspect you could come up with some heuristic that finds tree
>     entry boundaries with high probability, and in the low probability
>     case does not produce a wrong answer, but instead has to walk all
>     the way back to the beginning of the tree. That would be fine here.
>     But frankly, this "walk backwards" thing was just the straw that
>     broke the camel's back for me in working on this. Handling all the
>     possible cases was ending up quite complicated.

No, I tried this in JGit once. You can't do it reliably enough.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Pack v4 again..
  2015-02-16  4:59 ` Nicolas Pitre
  2015-02-16  6:45   ` Jeff King
@ 2015-02-16 10:11   ` Duy Nguyen
  1 sibling, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2015-02-16 10:11 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List

On Mon, Feb 16, 2015 at 11:59 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
>> I think pack v4 does not deliver its best promise that walking a tree
>> is simply following pointers and jumping from place to place. When we
>> want to copy from the middle of another tree, we need to scan from the
>> beginning of the tree. Tree offset cache helps, but the problem
>> remains. What do you think about an alternative format that each
>> "copy" instruction includes both index of the tree entry to copy from
>> (i.e. what we store now)  _and_ the byte offset from the beginning of
>> the tree? With this byte offset, we know exactly where to start
>> copying without scanning from the beginning. It will be a bit(?)
>> bigger, but it's also faster.
>
> That would make the format inflexible.  If we want to do partial
> repacking by, say, copying some objects and repacking others (some
> objects might need repacking because the objects they refer to are
> omitted from the repack) then if those repacked objects are themselves
> referred to by byte offset then we lose as the offset is no longer
> valid.

When generate new packs, we could rip out these offsets, but it
depends on whether we can reuse v4 trees unchanged like we do with v2
deltas. If we can, ripping is not an option because then we need to
parse more.

>> I imagine this is an optimization that can be done locally. The pack
>> transferred over network does not have these byte offsets. After the
>> pack is stored and verified by index-pack, we can rewrite it and add
>> this info. The simplest way is use a fixed size for this offset (e.g.
>> uint16_t or even uint8_t), add the place holder in copy instructions
>> of all v4 trees. After that object offsets will not change again and
>> we can start filling real offsets to placeholders.
>
> Having a local extra index is fine.  Just like the pack index which is
> always created locally and can be recreated at any time.  Some tree
> offset cache might be beneficial, but I'd avoid making it into the pack
> file itself.

Hm.. right.

> Yet, I think the biggest problem with pack v4 at the moment is the
> packing algorithm for tree objects.  We are piggy-backing on the pack v2
> object delta compression sorting and that produces suboptimal results
> due to deep recursions.  And it is the recursion that kills us. The pack
> v4 requires a new packing algorithm for its tree objects.

Yep. I made a conversion tool a few days ago to "flatten" v4 trees. So
if tree A copies some entries from B, but then B copies from C, tree A
could copy directly from C. Performance improves significantly (close
to v2 with rev-list, but still slower). But pack size doubles because
copy sequences are fragmented and we're duplicating same copy patterns
over and over again. All because we follow the single delta chain
decided since v2 time so we only have one tree with no copy sequences
(best to copy from).

> What I imagined is something like this:
>
> - Each canonical tree entry is made of a SHA1, mode and path.  Let's
>   assume this is hashed into a 24-bit value.
>
> - Each tree object can therefore be represented as a string of 24-bit
>   "characters".
>
> - Delta-compressing a tree object becomes a substring search where we
>   try to replace a sequence of "characters" with the longest "string"
>   possible from another object.  Repeat with the remaining sequences.
>
> Having a 24-bit hash value is totally arbitrary.  It could be 16 bits
> with more collisions but much faster search and less memory usage.  The
> optimal value would need to be determined after some experimentation.
>
> Algorithms for the longest common substring problem already exist.  So
> one of the classical algorithms could probably be adapted here.
>
> This would allow for exploiting the provision in pack v4 to copy from
> more than one tree object.  And this would also favor shallower
> recursions and even smaller packs.  Imposing a minimum substring length
> (rather than a maximum delta depth) would determine the runtime
> performance when using the pack afterwards.
>
> If you have enough free cycles to work on this, that's what I'd suggest
> you explore at this point. I wish I could myself as I think this ought
> to be rather cool work.

I'm getting there. I'm writing an alternative implementation to your
pv4_encode_tree() that takes multiple base trees instead of just one,
finding all copy sequences from these bases, then somehow pick the
best ones. After trees are sorted by similarity in pack-objects, we
preserve n trees as base trees (no copy sequences). There is a window
to feed some of these base trees to this encode function, like how we
do it in try_delta().

Your encoding tree entries as strings would be faster, but that's not
the immediate problem. Using the longest common substring is kinda
greedy (exactly what I'm thinking to do), but it probably produces a
suboptimal copy sequences. Maybe using two shorter copy sequences
would reduce fragmentation than one large copy sequence and a bunch of
individual tree entries, that sort of thing. Haven't worked it out
yet..
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-02-17  4:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-13 10:46 Pack v4 again Duy Nguyen
2015-02-16  4:59 ` Nicolas Pitre
2015-02-16  6:45   ` Jeff King
2015-02-16  7:12     ` Junio C Hamano
2015-02-16 10:48     ` Duy Nguyen
2015-02-17  4:12     ` Shawn Pearce
2015-02-16 10:11   ` Duy Nguyen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).