git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Why do base objects appear behind the delta in packs?
@ 2006-08-29 13:42 Shawn Pearce
  2006-08-29 14:40 ` Nicolas Pitre
  2006-08-29 14:58 ` Jakub Narebski
  0 siblings, 2 replies; 9+ messages in thread
From: Shawn Pearce @ 2006-08-29 13:42 UTC (permalink / raw)
  To: git

Sorry but this really is a pretty stupid question on my part:

In builtin-pack-objects.c write_one(), why is the base object written
behind the first delta that depends on it (if it hasn't been written
already) rather than BEFORE the first delta that depends on it?

If the base always had to appear before any delta that uses it then
unpack-objects wouldn't need to cache a delta in memory waiting
for the base to get unpacked.

>From a data locality perspective putting the base object before
or after the delta shouldn't matter, as either way the delta
is useless without the base.  So placing the base immediately
before the delta should perform just as well as placing it after.
Either way the OS should have the base in cache by the time the
delta is being accessed.

In other words, why not apply this patch and make it a requirement
of the pack file format?


diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 46f524d..5dd97b9 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -341,11 +341,11 @@ static unsigned long write_one(struct sh
 		 * if it is written already.
 		 */
 		return offset;
-	e->offset = offset;
-	offset += write_object(f, e);
 	/* if we are deltified, write out its base object. */
 	if (e->delta)
 		offset = write_one(f, e->delta, offset);
+	e->offset = offset;
+	offset += write_object(f, e);
 	return offset;
 }
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 13:42 Why do base objects appear behind the delta in packs? Shawn Pearce
@ 2006-08-29 14:40 ` Nicolas Pitre
  2006-08-29 14:58 ` Jakub Narebski
  1 sibling, 0 replies; 9+ messages in thread
From: Nicolas Pitre @ 2006-08-29 14:40 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Tue, 29 Aug 2006, Shawn Pearce wrote:

> Sorry but this really is a pretty stupid question on my part:
> 
> In builtin-pack-objects.c write_one(), why is the base object written
> behind the first delta that depends on it (if it hasn't been written
> already) rather than BEFORE the first delta that depends on it?

Most of the time the base object will have been written already since we 
favor backward deltas, and newer objects are written first.  But that 
might not always be the case.

> If the base always had to appear before any delta that uses it then
> unpack-objects wouldn't need to cache a delta in memory waiting
> for the base to get unpacked.

Like mentioned above this is not the common case.  And deltas are small 
anyway.  And when you think about it the delta and base objects have to 
be both in memory so this doesn't change anything in the end.  So in 
practice there is no really special caching.

> >From a data locality perspective putting the base object before
> or after the delta shouldn't matter, as either way the delta
> is useless without the base.  So placing the base immediately
> before the delta should perform just as well as placing it after.
> Either way the OS should have the base in cache by the time the
> delta is being accessed.

Not necessarily.  In fact if you checkout a particular revision with 
such a case, putting the base first will force a seek over it since what 
is referenced first is the delta, then seek back to the base.  If it is 
right after the delta then there is no need to seek back and some read 
ahead might have picked the base by the time it is referenced.  OK since 
all this is mmap()'ed there might not be much difference in practice, 
but in theory the current arrangement could have a slight advantage.

> In other words, why not apply this patch and make it a requirement
> of the pack file format?

I don't think this should be a requirement.


Nicolas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 13:42 Why do base objects appear behind the delta in packs? Shawn Pearce
  2006-08-29 14:40 ` Nicolas Pitre
@ 2006-08-29 14:58 ` Jakub Narebski
  2006-08-29 16:27   ` Shawn Pearce
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2006-08-29 14:58 UTC (permalink / raw)
  To: git

Shawn Pearce wrote:

> From a data locality perspective putting the base object before
> or after the delta shouldn't matter, as either way the delta
> is useless without the base.  So placing the base immediately
> before the delta should perform just as well as placing it after.
> Either way the OS should have the base in cache by the time the
> delta is being accessed.

_Should_ perform? Have you got any measurements of speed of creating "base
before delta" pack, and reading objects from this kind of pack?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 14:58 ` Jakub Narebski
@ 2006-08-29 16:27   ` Shawn Pearce
  2006-08-29 17:34     ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Shawn Pearce @ 2006-08-29 16:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn Pearce wrote:
> 
> > From a data locality perspective putting the base object before
> > or after the delta shouldn't matter, as either way the delta
> > is useless without the base.  So placing the base immediately
> > before the delta should perform just as well as placing it after.
> > Either way the OS should have the base in cache by the time the
> > delta is being accessed.
> 
> _Should_ perform? Have you got any measurements of speed of creating "base
> before delta" pack, and reading objects from this kind of pack?

No, not yet.  It just seemed odd to me that the base was put behind
the delta which then forces unpack-objects to hold a delta in memory
until it finds the corresponding base later in the stream when it
could have been just as simple to require the base appear before
the delta.  I wondered what the rationale was for the additional
complexity in unpack-objects.

Nicolas' reply pointed out that the current arrangement of base
after delta may actually offer improved performance due to the
OS performing read-ahead when you seek to the delta.  But he also
pointed out this base after delta situtation should be rather rare
as we try to delta older objects against newer objects and we try to
place newer objects at the front of the pack, so it likely shouldn't
matter that much.


I just instrumented builtin-pack-objects.c to count how many times
we put the delta before the base and then repacked a current Git
repo with `git repack -a -d -f`.  28167 objects, 19170 deltas. 6003
deltas appeared before their base objects.  So 31% of the time.
That's certainly not the common case but it does occur with some
frequency.  However resorting the output of verify-pack -v by offset
and visually looking at the entries you can clearly see it doesn't
happen very often early in the pack. Most of the objects in the
front of the pack are undeltafied commits.

This particular Git repository has 6723 commits and 905 trees that
weren't deltafied.  That's a total of 4 MiB of uncompressed data,
most of which appears at the front of the pack.  Only 68 commits
were deltas but 8067 trees were made into deltas.  The compressed
commits seemed to occupy the first 2 MiB of the pack file; that's
25% of the 8 MiB pack.  A commit-specific pack local dictionary
could be interesting here as it might some pack space.


I'm going to shutup now and not say anything further on the subject
unless I've got some hard results indicating a different organization
is better or worse than what we have right now.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 16:27   ` Shawn Pearce
@ 2006-08-29 17:34     ` Junio C Hamano
  2006-08-29 17:44       ` Shawn Pearce
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2006-08-29 17:34 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

>> Shawn Pearce wrote:
>> 
>> > From a data locality perspective putting the base object before
>> > or after the delta shouldn't matter, as either way the delta
>> > is useless without the base.  So placing the base immediately
>> > before the delta should perform just as well as placing it after.
>> > Either way the OS should have the base in cache by the time the
>> > delta is being accessed.
>... 
> I'm going to shutup now and not say anything further on the subject
> unless I've got some hard results indicating a different organization
> is better or worse than what we have right now.

I think that may be a sensible thing to do (no sarcasm -- I
think this measurement is long overdue).

The code was initially proposed just like you suggested but is
in the current form precisely for the reason of avoiding
back-seek.  I distinctly remember me asking Linus "does mmap()
favor forward scan by doing readahead?  I thought its point was
to allow random access" (the answer is "yes" and "yes but
forward is the common case").

The pack-using side in sha1_file.c used to read deltified object
(both header and delta) in full, pick up and read base, and
apply delta to base.  This was thought to be memory hungry on a
longer delta chain, so the current code reads only the header of
a deltified object, reads base, then reads the delta to apply.
The last step involves seeking back, and might make the
back-seek avoidance less effective than before.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 17:34     ` Junio C Hamano
@ 2006-08-29 17:44       ` Shawn Pearce
  2006-08-29 18:16         ` Nicolas Pitre
  0 siblings, 1 reply; 9+ messages in thread
From: Shawn Pearce @ 2006-08-29 17:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
> 
> >> Shawn Pearce wrote:
> >> 
> >> > From a data locality perspective putting the base object before
> >> > or after the delta shouldn't matter, as either way the delta
> >> > is useless without the base.  So placing the base immediately
> >> > before the delta should perform just as well as placing it after.
> >> > Either way the OS should have the base in cache by the time the
> >> > delta is being accessed.
> >... 
> > I'm going to shutup now and not say anything further on the subject
> > unless I've got some hard results indicating a different organization
> > is better or worse than what we have right now.
> 
> I think that may be a sensible thing to do (no sarcasm -- I
> think this measurement is long overdue).
> 
> The code was initially proposed just like you suggested but is
> in the current form precisely for the reason of avoiding
> back-seek.  I distinctly remember me asking Linus "does mmap()
> favor forward scan by doing readahead?  I thought its point was
> to allow random access" (the answer is "yes" and "yes but
> forward is the common case").
> 
> The pack-using side in sha1_file.c used to read deltified object
> (both header and delta) in full, pick up and read base, and
> apply delta to base.  This was thought to be memory hungry on a
> longer delta chain, so the current code reads only the header of
> a deltified object, reads base, then reads the delta to apply.
> The last step involves seeking back, and might make the
> back-seek avoidance less effective than before.

Thank you.  That was the sort of response I was looking for.  :-)

I know Jon wants to shrink that ~500 MB Mozilla pack to something
a lot smaller, and I'd like to help him do that without losing huge
amounts of performance on the read.  Very long delta chains (5000!)
are simply impossible to wade through for even one object; doing it
for an entire commit to checkout the files is something I wouldn't
want to wish on anyone.

So I'm probably going to wind up spending some time doing research
and experimentation on pack storage.  I may just discover we're
as good as we can get.  Or I may find that doing something else
saves us only 5% at the cost of far too much complexity and thus
isn't really worth doing.  Or I may get lucky and discover a way
to improve on what we have.

More on this thread (maybe) in a few months.  I have other stuff
I should be doing right now.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 17:44       ` Shawn Pearce
@ 2006-08-29 18:16         ` Nicolas Pitre
  2006-08-29 18:32           ` Shawn Pearce
  0 siblings, 1 reply; 9+ messages in thread
From: Nicolas Pitre @ 2006-08-29 18:16 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Tue, 29 Aug 2006, Shawn Pearce wrote:

> So I'm probably going to wind up spending some time doing research
> and experimentation on pack storage.  I may just discover we're
> as good as we can get.  Or I may find that doing something else
> saves us only 5% at the cost of far too much complexity and thus
> isn't really worth doing.  Or I may get lucky and discover a way
> to improve on what we have.

I'm periodically looking for improvements on packing performances 
myself.

I look forward to being able to use your (and Jon's) fast-import work in 
order to play with (re)generation of big packs myself.


Nicolas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 18:16         ` Nicolas Pitre
@ 2006-08-29 18:32           ` Shawn Pearce
  2006-08-29 19:23             ` Jon Smirl
  0 siblings, 1 reply; 9+ messages in thread
From: Shawn Pearce @ 2006-08-29 18:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> wrote:
> I look forward to being able to use your (and Jon's) fast-import work in 
> order to play with (re)generation of big packs myself.

My repository is here:

  http://www.spearce.org/projects/scm/git.git

branch 'refs/heads/sp/fastpack'.

I'd appreciate it if folks didn't clone directly from me but instead
used an existing clone to pull the branch down into.  Its based on
a fairly recent 'next' branch.  Anything not available via Junio's
'next' on kernel.org is loose objects in this repository and are
specific to my fast-import work.

Documentation of the protocol used to stuff a pack is in the
header of fast-import.c.  Its a relatively trivial stream format.
It should be quite simple to generate a random large project and
pipe it into fast-import to get a resulting pack to play with.

I don't have Jon's cvs2svn code and I don't know if its ready for
public consumption yet.  git-fast-import however looks like its
almost there, so I'm making the URL publicly available for those
that may be interested in it.

[Junio: please do the lazy thing and don't pull this into Git just
 yet, I don't think this branch is ready for that, not even for pu.]

-- 
Shawn.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Why do base objects appear behind the delta in packs?
  2006-08-29 18:32           ` Shawn Pearce
@ 2006-08-29 19:23             ` Jon Smirl
  0 siblings, 0 replies; 9+ messages in thread
From: Jon Smirl @ 2006-08-29 19:23 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, git

On 8/29/06, Shawn Pearce <spearce@spearce.org> wrote:
> I don't have Jon's cvs2svn code and I don't know if its ready for
> public consumption yet.  git-fast-import however looks like its
> almost there, so I'm making the URL publicly available for those
> that may be interested in it.

If anyone is interested in cvs2svn mods let me know (they are in
Python) and I will send you a snap shot. The current code gets all the
blobs, tree and commits into a pack via fast-import, but the trees not
being generated correctly for complex branching when there are issues
with the symbols in CVS.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-08-29 19:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-29 13:42 Why do base objects appear behind the delta in packs? Shawn Pearce
2006-08-29 14:40 ` Nicolas Pitre
2006-08-29 14:58 ` Jakub Narebski
2006-08-29 16:27   ` Shawn Pearce
2006-08-29 17:34     ` Junio C Hamano
2006-08-29 17:44       ` Shawn Pearce
2006-08-29 18:16         ` Nicolas Pitre
2006-08-29 18:32           ` Shawn Pearce
2006-08-29 19:23             ` Jon Smirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).