[PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
@ 2009-08-06  9:55 Nick Edelen
  2009-08-06 14:48 ` Johannes Schindelin
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-06  9:55 UTC (permalink / raw)
  To: Junio C Hamano, Johannes Schindelin, Jeff King, Sam Vilain,
	Shawn O. Pearce

SUGGESTED FOR 'PU':

Traversing objects is currently very costly, as every commit and tree must be 
loaded and parsed.  Much time and energy could be saved by caching metadata and 
topological info in an efficient, easily accessible manner.  Furthermore, this 
could improve git's interfacing potential, by providing a condensed summary of 
a repository's commit tree.

This is a series to implement such a revision caching mechanism, aptly named 
rev-cache.  The series will provide:
 - a core API to manipulate and traverse caches
 - an integration into the internal revision walker
 - a porcelain front-end providing access to users and (shell) applications
 - a series of tests to verify/demonstrate correctness
 - documentation of the API, porcelain and core concepts

In cold starts rev-cache has sped up packing and walking by a factor of 4, and 
over twice that on warm starts.  Some times on slax for the linux repository:

rev-list --all --objects >/dev/null
 default
   cold    1:13
   warm    0:43
 rev-cache'd
   cold    0:19
   warm    0:02

pack-objects --revs --all --stdout >/dev/null
 default
   cold    2:44
   warm    1:21
 rev-cache'd
   cold    0:44
   warm    0:10

The mechanism is minimally intrusive: most of the changes take place in 
seperate files, and only a handful of git's existing functions are modified.

Hope you find this useful.

 - Nick

 Documentation/rev-cache.txt           |   51 +
 Documentation/technical/rev-cache.txt |  336 ++++++
 Makefile                              |    2 +
 blob.c                                |    1 +
 blob.h                                |    1 +
 builtin-rev-cache.c                   |  284 +++++
 builtin.h                             |    1 +
 commit.c                              |    3 +
 commit.h                              |    2 +
 git.c                                 |    1 +
 list-objects.c                        |   49 +-
 rev-cache.c                           | 1832 +++++++++++++++++++++++++++++++++
 revision.c                            |   89 ++-
 revision.h                            |   46 +-
 t/t6015-rev-cache-list.sh             |  228 ++++
 t/t6015-sha1-dump-diff.py             |   36 +
 16 files changed, 2937 insertions(+), 25 deletions(-)
 create mode 100755 Documentation/rev-cache.txt
 create mode 100755 Documentation/technical/rev-cache.txt
 create mode 100755 builtin-rev-cache.c
 create mode 100755 rev-cache.c
 create mode 100755 t/t6015-rev-cache-list.sh
 create mode 100755 t/t6015-sha1-dump-diff.py

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06  9:55 [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking Nick Edelen
@ 2009-08-06 14:48 ` Johannes Schindelin
  2009-08-06 14:58   ` Michael J Gruber
  0 siblings, 1 reply; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-06 14:48 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Junio C Hamano, Jeff King, Sam Vilain, Shawn O. Pearce,
	Andreas Ericsson, Christian Couder, git@vger.kernel.org

Hi,

On Thu, 6 Aug 2009, Nick Edelen wrote:

> SUGGESTED FOR 'PU':
> 
> Traversing objects is currently very costly, as every commit and tree must be 
> loaded and parsed.  Much time and energy could be saved by caching metadata and 
> topological info in an efficient, easily accessible manner.  Furthermore, this 
> could improve git's interfacing potential, by providing a condensed summary of 
> a repository's commit tree.
> 
> This is a series to implement such a revision caching mechanism, aptly named 
> rev-cache.  The series will provide:
>  - a core API to manipulate and traverse caches
>  - an integration into the internal revision walker
>  - a porcelain front-end providing access to users and (shell) applications
>  - a series of tests to verify/demonstrate correctness
>  - documentation of the API, porcelain and core concepts
> 
> In cold starts rev-cache has sped up packing and walking by a factor of 4, and 
> over twice that on warm starts.  Some times on slax for the linux repository:
> 
> rev-list --all --objects >/dev/null
>  default
>    cold    1:13
>    warm    0:43
>  rev-cache'd
>    cold    0:19
>    warm    0:02
> 
> pack-objects --revs --all --stdout >/dev/null
>  default
>    cold    2:44
>    warm    1:21
>  rev-cache'd
>    cold    0:44
>    warm    0:10

Nice!

> The mechanism is minimally intrusive: most of the changes take place in 
> seperate files, and only a handful of git's existing functions are 
> modified.

Sorry, I forgot the details, could you quickly remind me why these caches 
are not in the pack index files?

>  Documentation/rev-cache.txt           |   51 +
>  Documentation/technical/rev-cache.txt |  336 ++++++
>  Makefile                              |    2 +
>  blob.c                                |    1 +
>  blob.h                                |    1 +
>  builtin-rev-cache.c                   |  284 +++++
>  builtin.h                             |    1 +
>  commit.c                              |    3 +
>  commit.h                              |    2 +
>  git.c                                 |    1 +
>  list-objects.c                        |   49 +-
>  rev-cache.c                           | 1832 +++++++++++++++++++++++++++++++++
>  revision.c                            |   89 ++-
>  revision.h                            |   46 +-
>  t/t6015-rev-cache-list.sh             |  228 ++++
>  t/t6015-sha1-dump-diff.py             |   36 +

Hmpf.

We got rid of the last Python script in Git a long time ago, but now two 
different patch series try to sneak that dependency (at least for testing) 
back in.

That's all the worse because we cannot use Python in msysGit, and Windows 
should be a platform benefitting dramatically from your work.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to   significantly speed up packing/walking
  2009-08-06 14:48 ` Johannes Schindelin
@ 2009-08-06 14:58   ` Michael J Gruber
  2009-08-06 17:39     ` Nick Edelen
  0 siblings, 1 reply; 34+ messages in thread
From: Michael J Gruber @ 2009-08-06 14:58 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nick Edelen, Junio C Hamano, Jeff King, Sam Vilain,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Johannes Schindelin venit, vidit, dixit 06.08.2009 16:48:
> Hi,
> 
> On Thu, 6 Aug 2009, Nick Edelen wrote:
> 
>> SUGGESTED FOR 'PU':
>>
>> Traversing objects is currently very costly, as every commit and tree must be 
>> loaded and parsed.  Much time and energy could be saved by caching metadata and 
>> topological info in an efficient, easily accessible manner.  Furthermore, this 
>> could improve git's interfacing potential, by providing a condensed summary of 
>> a repository's commit tree.
>>
>> This is a series to implement such a revision caching mechanism, aptly named 
>> rev-cache.  The series will provide:
>>  - a core API to manipulate and traverse caches
>>  - an integration into the internal revision walker
>>  - a porcelain front-end providing access to users and (shell) applications
>>  - a series of tests to verify/demonstrate correctness
>>  - documentation of the API, porcelain and core concepts
>>
>> In cold starts rev-cache has sped up packing and walking by a factor of 4, and 
>> over twice that on warm starts.  Some times on slax for the linux repository:
>>
>> rev-list --all --objects >/dev/null
>>  default
>>    cold    1:13
>>    warm    0:43
>>  rev-cache'd
>>    cold    0:19
>>    warm    0:02
>>
>> pack-objects --revs --all --stdout >/dev/null
>>  default
>>    cold    2:44
>>    warm    1:21
>>  rev-cache'd
>>    cold    0:44
>>    warm    0:10
> 
> Nice!
> 
>> The mechanism is minimally intrusive: most of the changes take place in 
>> seperate files, and only a handful of git's existing functions are 
>> modified.
> 
> Sorry, I forgot the details, could you quickly remind me why these caches 
> are not in the pack index files?
> 
>>  Documentation/rev-cache.txt           |   51 +
>>  Documentation/technical/rev-cache.txt |  336 ++++++
>>  Makefile                              |    2 +
>>  blob.c                                |    1 +
>>  blob.h                                |    1 +
>>  builtin-rev-cache.c                   |  284 +++++
>>  builtin.h                             |    1 +
>>  commit.c                              |    3 +
>>  commit.h                              |    2 +
>>  git.c                                 |    1 +
>>  list-objects.c                        |   49 +-
>>  rev-cache.c                           | 1832 +++++++++++++++++++++++++++++++++
>>  revision.c                            |   89 ++-
>>  revision.h                            |   46 +-
>>  t/t6015-rev-cache-list.sh             |  228 ++++
>>  t/t6015-sha1-dump-diff.py             |   36 +
> 
> Hmpf.
> 
> We got rid of the last Python script in Git a long time ago, but now two 
> different patch series try to sneak that dependency (at least for testing) 
> back in.
> 
> That's all the worse because we cannot use Python in msysGit, and Windows 
> should be a platform benefitting dramatically from your work.

In fact, the test the script performs could be easily rephrased with
"sort", "uniq" and "comm".
OTOH: If the walker is supposed to return the exact same orderd list of
commits you can just use test_cmp.

Michael

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 14:58   ` Michael J Gruber
@ 2009-08-06 17:39     ` Nick Edelen
  2009-08-06 19:06       ` Johannes Schindelin
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-06 17:39 UTC (permalink / raw)
  To: Michael J Gruber
  Cc: Johannes Schindelin, Junio C Hamano, Jeff King, Sam Vilain,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi there,

> Sorry, I forgot the details, could you quickly remind me why these caches
> are not in the pack index files?

Er, I'm not sure what you mean.  Are you asking why these revision
caches are required if we have a pack index, or why they aren't in the
pack index, or something different?  I'm thinking probably the second:
the short answer is that cache slices are totally independant of pack
files.

It might be possible to somehow merge revision cache slices with pack
indexes, but I don't think it'd be a very suitable modification.  The
rev-cache slices are meant to act almost like topo-relation pack
files: new slices are simply new files, seperate slice files can be
fused ("repacked") into a larger one, the index is a (recreatable)
single file associating file (positions) with objects.  The format was
geared to reducing potential cache/data loss and preventing overly
large cache slices.

>> Hmpf.
>>
>> We got rid of the last Python script in Git a long time ago, but now two
>> different patch series try to sneak that dependency (at least for testing)
>> back in.
>>
>> That's all the worse because we cannot use Python in msysGit, and Windows
>> should be a platform benefitting dramatically from your work.
>
> In fact, the test the script performs could be easily rephrased with
> "sort", "uniq" and "comm".
> OTOH: If the walker is supposed to return the exact same orderd list of
> commits you can just use test_cmp.

The language that script is written in isn't important.  I originally
wrote it in python because I wanted something quick and wasn't much of
a sh guru (sorry :-/ ).  As Micheal said I've no doubt it can easily
be converted to shell script -- in fact, I'll try to get a shell
version working today.

 - Nick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 17:39     ` Nick Edelen
@ 2009-08-06 19:06       ` Johannes Schindelin
  2009-08-06 20:01         ` Nick Edelen
                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-06 19:06 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Michael J Gruber, Junio C Hamano, Jeff King, Sam Vilain,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

On Thu, 6 Aug 2009, Nick Edelen wrote:

> > Sorry, I forgot the details, could you quickly remind me why these 
> > caches are not in the pack index files?
> 
> Er, I'm not sure what you mean.  Are you asking why these revision 
> caches are required if we have a pack index, or why they aren't in the 
> pack index, or something different?  I'm thinking probably the second:

Yep.

> the short answer is that cache slices are totally independant of pack 
> files.

My idea with that was that you already have a SHA-1 map in the pack index, 
and if all you want to be able to accelerate the revision walker, you'd 
probably need something that adds yet another mapping, from commit to 
parents and tree, and from tree to sub-tree and blob (so you can avoid 
unpacking commit and tree objects).

I just thought that it could be more efficient to do it at the time the 
pack index is written _anyway_, as nothing will change in the pack after 
that anyway.

But I guess I can answer my question easily myself: the boundary commits 
will not be handled that way.

Still, there is some redundancy between the pack index and your cache, as 
you have to write out the whole list of SHA-1s all over again.  I guess it 
is time to look at the code instead of asking stupid questions.

> It might be possible to somehow merge revision cache slices with pack 
> indexes, but I don't think it'd be a very suitable modification.  The 
> rev-cache slices are meant to act almost like topo-relation pack files: 
> new slices are simply new files, seperate slice files can be fused 
> ("repacked") into a larger one, the index is a (recreatable) single file 
> associating file (positions) with objects.  The format was geared to 
> reducing potential cache/data loss and preventing overly large cache 
> slices.
> 
> >> Hmpf.
> >>
> >> We got rid of the last Python script in Git a long time ago, but now 
> >> two different patch series try to sneak that dependency (at least for 
> >> testing) back in.
> >>
> >> That's all the worse because we cannot use Python in msysGit, and 
> >> Windows should be a platform benefitting dramatically from your work.
> >
> > In fact, the test the script performs could be easily rephrased with 
> > "sort", "uniq" and "comm". OTOH: If the walker is supposed to return 
> > the exact same orderd list of commits you can just use test_cmp.
> 
> The language that script is written in isn't important.  I originally
> wrote it in python because I wanted something quick and wasn't much of
> a sh guru (sorry :-/ ).  As Micheal said I've no doubt it can easily
> be converted to shell script

That is not what I wanted to hear.

> -- in fact, I'll try to get a shell version working today.

That is.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 19:06       ` Johannes Schindelin
@ 2009-08-06 20:01         ` Nick Edelen
  2009-08-06 20:30           ` Nick Edelen
  2009-08-07  2:47         ` Sam Vilain
  2009-08-08 18:57         ` Junio C Hamano
  2 siblings, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-06 20:01 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Michael J Gruber, Junio C Hamano, Jeff King, Sam Vilain,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

>My idea with that was that you already have a SHA-1 map in the pack index,
>and if all you want to be able to accelerate the revision walker, you'd
>probably need something that adds yet another mapping, from commit to
>parents and tree, and from tree to sub-tree and blob (so you can avoid
>unpacking commit and tree objects).

As I mention in one of the patch descriptions, along with each commit
a list of objects introduced per commit is cached, so no extra I/O is
necessary for tree recursion, etc. during traversal.

>I just thought that it could be more efficient to do it at the time the
>pack index is written _anyway_, as nothing will change in the pack after
>that anyway.

Nothing might change in the pack, but the slices were made to allow
for continual addition and refinement of the cache.  In a typical
usage slices will be added and fused on a regular basis, which would
require tinkering in pack indexes if they were combined.

>But I guess I can answer my question easily myself: the boundary commits
>will not be handled that way.
>
>Still, there is some redundancy between the pack index and your cache, as
>you have to write out the whole list of SHA-1s all over again.  I guess it
>is time to look at the code instead of asking stupid questions.

The whole revision cache is redundant, technically speaking: nothing
in it can't be found by rummaging through packs or objects.  The point
of it was to distill out important information for fast, easy access
of the commit tree.

On another note, I've eliminated the python dependancy.  Shall I
resend the patchset now or should I wait until it has been further
reviewed? (don't want to flood the list with resubmits)

 - Nick

On Thu, Aug 6, 2009 at 9:06 PM, Johannes
Schindelin<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 6 Aug 2009, Nick Edelen wrote:
>
>> > Sorry, I forgot the details, could you quickly remind me why these
>> > caches are not in the pack index files?
>>
>> Er, I'm not sure what you mean.  Are you asking why these revision
>> caches are required if we have a pack index, or why they aren't in the
>> pack index, or something different?  I'm thinking probably the second:
>
> Yep.
>
>> the short answer is that cache slices are totally independant of pack
>> files.
>
> My idea with that was that you already have a SHA-1 map in the pack index,
> and if all you want to be able to accelerate the revision walker, you'd
> probably need something that adds yet another mapping, from commit to
> parents and tree, and from tree to sub-tree and blob (so you can avoid
> unpacking commit and tree objects).
>
> I just thought that it could be more efficient to do it at the time the
> pack index is written _anyway_, as nothing will change in the pack after
> that anyway.
>
> But I guess I can answer my question easily myself: the boundary commits
> will not be handled that way.
>
> Still, there is some redundancy between the pack index and your cache, as
> you have to write out the whole list of SHA-1s all over again.  I guess it
> is time to look at the code instead of asking stupid questions.
>
>> It might be possible to somehow merge revision cache slices with pack
>> indexes, but I don't think it'd be a very suitable modification.  The
>> rev-cache slices are meant to act almost like topo-relation pack files:
>> new slices are simply new files, seperate slice files can be fused
>> ("repacked") into a larger one, the index is a (recreatable) single file
>> associating file (positions) with objects.  The format was geared to
>> reducing potential cache/data loss and preventing overly large cache
>> slices.
>>
>> >> Hmpf.
>> >>
>> >> We got rid of the last Python script in Git a long time ago, but now
>> >> two different patch series try to sneak that dependency (at least for
>> >> testing) back in.
>> >>
>> >> That's all the worse because we cannot use Python in msysGit, and
>> >> Windows should be a platform benefitting dramatically from your work.
>> >
>> > In fact, the test the script performs could be easily rephrased with
>> > "sort", "uniq" and "comm". OTOH: If the walker is supposed to return
>> > the exact same orderd list of commits you can just use test_cmp.
>>
>> The language that script is written in isn't important.  I originally
>> wrote it in python because I wanted something quick and wasn't much of
>> a sh guru (sorry :-/ ).  As Micheal said I've no doubt it can easily
>> be converted to shell script
>
> That is not what I wanted to hear.
>
>> -- in fact, I'll try to get a shell version working today.
>
> That is.
>
> Thanks,
> Dscho
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 20:01         ` Nick Edelen
@ 2009-08-06 20:30           ` Nick Edelen
  2009-08-06 20:32             ` Shawn O. Pearce
  2009-08-07  4:42             ` Nicolas Pitre
  0 siblings, 2 replies; 34+ messages in thread
From: Nick Edelen @ 2009-08-06 20:30 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Michael J Gruber, Junio C Hamano, Jeff King, Sam Vilain,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hrmm, I just realized that it dosn't actually cache paths/names...
This obviously has no bearing on its use in packing, but I should
either add that in or restrict usage in non-packing-related walks.
Weird how things like that escape you.

I think I may go ahead and add support for this tomorrow.  It should
have no effect on performance and very little impact on cache slice
size.

On Thu, Aug 6, 2009 at 10:01 PM, Nick Edelen<sirnot@gmail.com> wrote:
> Hi,
>
>>My idea with that was that you already have a SHA-1 map in the pack index,
>>and if all you want to be able to accelerate the revision walker, you'd
>>probably need something that adds yet another mapping, from commit to
>>parents and tree, and from tree to sub-tree and blob (so you can avoid
>>unpacking commit and tree objects).
>
> As I mention in one of the patch descriptions, along with each commit
> a list of objects introduced per commit is cached, so no extra I/O is
> necessary for tree recursion, etc. during traversal.
>
>>I just thought that it could be more efficient to do it at the time the
>>pack index is written _anyway_, as nothing will change in the pack after
>>that anyway.
>
> Nothing might change in the pack, but the slices were made to allow
> for continual addition and refinement of the cache.  In a typical
> usage slices will be added and fused on a regular basis, which would
> require tinkering in pack indexes if they were combined.
>
>>But I guess I can answer my question easily myself: the boundary commits
>>will not be handled that way.
>>
>>Still, there is some redundancy between the pack index and your cache, as
>>you have to write out the whole list of SHA-1s all over again.  I guess it
>>is time to look at the code instead of asking stupid questions.
>
> The whole revision cache is redundant, technically speaking: nothing
> in it can't be found by rummaging through packs or objects.  The point
> of it was to distill out important information for fast, easy access
> of the commit tree.
>
> On another note, I've eliminated the python dependancy.  Shall I
> resend the patchset now or should I wait until it has been further
> reviewed? (don't want to flood the list with resubmits)
>
>  - Nick
>
> On Thu, Aug 6, 2009 at 9:06 PM, Johannes
> Schindelin<Johannes.Schindelin@gmx.de> wrote:
>> Hi,
>>
>> On Thu, 6 Aug 2009, Nick Edelen wrote:
>>
>>> > Sorry, I forgot the details, could you quickly remind me why these
>>> > caches are not in the pack index files?
>>>
>>> Er, I'm not sure what you mean.  Are you asking why these revision
>>> caches are required if we have a pack index, or why they aren't in the
>>> pack index, or something different?  I'm thinking probably the second:
>>
>> Yep.
>>
>>> the short answer is that cache slices are totally independant of pack
>>> files.
>>
>> My idea with that was that you already have a SHA-1 map in the pack index,
>> and if all you want to be able to accelerate the revision walker, you'd
>> probably need something that adds yet another mapping, from commit to
>> parents and tree, and from tree to sub-tree and blob (so you can avoid
>> unpacking commit and tree objects).
>>
>> I just thought that it could be more efficient to do it at the time the
>> pack index is written _anyway_, as nothing will change in the pack after
>> that anyway.
>>
>> But I guess I can answer my question easily myself: the boundary commits
>> will not be handled that way.
>>
>> Still, there is some redundancy between the pack index and your cache, as
>> you have to write out the whole list of SHA-1s all over again.  I guess it
>> is time to look at the code instead of asking stupid questions.
>>
>>> It might be possible to somehow merge revision cache slices with pack
>>> indexes, but I don't think it'd be a very suitable modification.  The
>>> rev-cache slices are meant to act almost like topo-relation pack files:
>>> new slices are simply new files, seperate slice files can be fused
>>> ("repacked") into a larger one, the index is a (recreatable) single file
>>> associating file (positions) with objects.  The format was geared to
>>> reducing potential cache/data loss and preventing overly large cache
>>> slices.
>>>
>>> >> Hmpf.
>>> >>
>>> >> We got rid of the last Python script in Git a long time ago, but now
>>> >> two different patch series try to sneak that dependency (at least for
>>> >> testing) back in.
>>> >>
>>> >> That's all the worse because we cannot use Python in msysGit, and
>>> >> Windows should be a platform benefitting dramatically from your work.
>>> >
>>> > In fact, the test the script performs could be easily rephrased with
>>> > "sort", "uniq" and "comm". OTOH: If the walker is supposed to return
>>> > the exact same orderd list of commits you can just use test_cmp.
>>>
>>> The language that script is written in isn't important.  I originally
>>> wrote it in python because I wanted something quick and wasn't much of
>>> a sh guru (sorry :-/ ).  As Micheal said I've no doubt it can easily
>>> be converted to shell script
>>
>> That is not what I wanted to hear.
>>
>>> -- in fact, I'll try to get a shell version working today.
>>
>> That is.
>>
>> Thanks,
>> Dscho
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 20:30           ` Nick Edelen
@ 2009-08-06 20:32             ` Shawn O. Pearce
  2009-08-06 23:35               ` A Large Angry SCM
  2009-08-07  4:42             ` Nicolas Pitre
  1 sibling, 1 reply; 34+ messages in thread
From: Shawn O. Pearce @ 2009-08-06 20:32 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Johannes Schindelin, Michael J Gruber, Junio C Hamano, Jeff King,
	Sam Vilain, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Nick Edelen <sirnot@gmail.com> wrote:
> Hrmm, I just realized that it dosn't actually cache paths/names...
> This obviously has no bearing on its use in packing, but I should
> either add that in or restrict usage in non-packing-related walks.
> Weird how things like that escape you.
> 
> I think I may go ahead and add support for this tomorrow.  It should
> have no effect on performance and very little impact on cache slice
> size.

You may not need the path name, but instead the hash value that
pack-objects computes from the path name.  All that matters is
the hash, so pack-objects can schedule the objects into the right
buckets when its doing delta computation for objects which are not
yet delta compressed, or whose delta cannot be suitably reused.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 20:32             ` Shawn O. Pearce
@ 2009-08-06 23:35               ` A Large Angry SCM
  2009-08-06 23:37                 ` Shawn O. Pearce
  0 siblings, 1 reply; 34+ messages in thread
From: A Large Angry SCM @ 2009-08-06 23:35 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nick Edelen, Johannes Schindelin, Michael J Gruber,
	Junio C Hamano, Jeff King, Sam Vilain, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

Shawn O. Pearce wrote:
> Nick Edelen <sirnot@gmail.com> wrote:
>> Hrmm, I just realized that it dosn't actually cache paths/names...
>> This obviously has no bearing on its use in packing, but I should
>> either add that in or restrict usage in non-packing-related walks.
>> Weird how things like that escape you.
>>
>> I think I may go ahead and add support for this tomorrow.  It should
>> have no effect on performance and very little impact on cache slice
>> size.
> 
> You may not need the path name, but instead the hash value that
> pack-objects computes from the path name.  All that matters is
> the hash, so pack-objects can schedule the objects into the right
> buckets when its doing delta computation for objects which are not
> yet delta compressed, or whose delta cannot be suitably reused.
> 

Please do NOT expose the hash values. The hash used by pack-objects is 
an implementation detail of the heuristics used by the _current_ object 
packing code. It would be a real shame to have to maintain backward 
compatibility with it at some future date after the packing machinery 
has changed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 23:35               ` A Large Angry SCM
@ 2009-08-06 23:37                 ` Shawn O. Pearce
  2009-08-06 23:43                   ` A Large Angry SCM
  2009-08-07  6:05                   ` Johannes Schindelin
  0 siblings, 2 replies; 34+ messages in thread
From: Shawn O. Pearce @ 2009-08-06 23:37 UTC (permalink / raw)
  To: A Large Angry SCM
  Cc: Nick Edelen, Johannes Schindelin, Michael J Gruber,
	Junio C Hamano, Jeff King, Sam Vilain, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

A Large Angry SCM <gitzilla@gmail.com> wrote:
> Shawn O. Pearce wrote:
>> Nick Edelen <sirnot@gmail.com> wrote:
>>> Hrmm, I just realized that it dosn't actually cache paths/names...
>>
>> You may not need the path name, but instead the hash value that
>> pack-objects computes from the path name.
>
> Please do NOT expose the hash values. The hash used by pack-objects is  
> an implementation detail of the heuristics used by the _current_ object  
> packing code. It would be a real shame to have to maintain backward  
> compatibility with it at some future date after the packing machinery  
> has changed.

This is a local cache.  If there was a version number in the header,
and the hash function changes, we could just bump the version number
and invalidate all of the caches.

No sense in storing (and doing IO of) huge duplicate string values
for something where we really only need 32 bits, and where a
recompute from scratch only costs a minute.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 23:37                 ` Shawn O. Pearce
@ 2009-08-06 23:43                   ` A Large Angry SCM
  2009-08-07  0:15                     ` Nick Edelen
  2009-08-07  6:05                   ` Johannes Schindelin
  1 sibling, 1 reply; 34+ messages in thread
From: A Large Angry SCM @ 2009-08-06 23:43 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nick Edelen, Johannes Schindelin, Michael J Gruber,
	Junio C Hamano, Jeff King, Sam Vilain, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

Shawn O. Pearce wrote:
> A Large Angry SCM <gitzilla@gmail.com> wrote:
>> Shawn O. Pearce wrote:
>>> Nick Edelen <sirnot@gmail.com> wrote:
>>>> Hrmm, I just realized that it dosn't actually cache paths/names...
>>> You may not need the path name, but instead the hash value that
>>> pack-objects computes from the path name.
>> Please do NOT expose the hash values. The hash used by pack-objects is  
>> an implementation detail of the heuristics used by the _current_ object  
>> packing code. It would be a real shame to have to maintain backward  
>> compatibility with it at some future date after the packing machinery  
>> has changed.
> 
> This is a local cache.  If there was a version number in the header,
> and the hash function changes, we could just bump the version number
> and invalidate all of the caches.
> 
> No sense in storing (and doing IO of) huge duplicate string values
> for something where we really only need 32 bits, and where a
> recompute from scratch only costs a minute.
> 

That will work for me if the cache gets a version number and iff the 
pack-objects hash code gets big warning comments about the cache code 
dependency.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 23:43                   ` A Large Angry SCM
@ 2009-08-07  0:15                     ` Nick Edelen
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Edelen @ 2009-08-07  0:15 UTC (permalink / raw)
  To: gitzilla
  Cc: Shawn O. Pearce, Johannes Schindelin, Michael J Gruber,
	Junio C Hamano, Jeff King, Sam Vilain, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

That would work, but I sorta like the idea of caching the actual
names.  I'm thinking of having a block of slice-unique, null-seperated
names at the end of each slice (ie. not in the mapping) which is
loaded into memory (it wouldn't be very big).  Then each blob/tree
object would have an variable length index referencing a particular
name.

Using the actual names would give us greater flexbility, and allow
rev-cache to output proper rev-list type output (with the names after
the hashes).

On Fri, Aug 7, 2009 at 1:43 AM, A Large Angry SCM<gitzilla@gmail.com> wrote:
> Shawn O. Pearce wrote:
>>
>> A Large Angry SCM <gitzilla@gmail.com> wrote:
>>>
>>> Shawn O. Pearce wrote:
>>>>
>>>> Nick Edelen <sirnot@gmail.com> wrote:
>>>>>
>>>>> Hrmm, I just realized that it dosn't actually cache paths/names...
>>>>
>>>> You may not need the path name, but instead the hash value that
>>>> pack-objects computes from the path name.
>>>
>>> Please do NOT expose the hash values. The hash used by pack-objects is
>>>  an implementation detail of the heuristics used by the _current_ object
>>>  packing code. It would be a real shame to have to maintain backward
>>>  compatibility with it at some future date after the packing machinery  has
>>> changed.
>>
>> This is a local cache.  If there was a version number in the header,
>> and the hash function changes, we could just bump the version number
>> and invalidate all of the caches.
>>
>> No sense in storing (and doing IO of) huge duplicate string values
>> for something where we really only need 32 bits, and where a
>> recompute from scratch only costs a minute.
>>
>
> That will work for me if the cache gets a version number and iff the
> pack-objects hash code gets big warning comments about the cache code
> dependency.
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 23:37                 ` Shawn O. Pearce
  2009-08-06 23:43                   ` A Large Angry SCM
@ 2009-08-07  6:05                   ` Johannes Schindelin
  1 sibling, 0 replies; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-07  6:05 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: A Large Angry SCM, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Sam Vilain, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org



On Thu, 6 Aug 2009, Shawn O. Pearce wrote:

> A Large Angry SCM <gitzilla@gmail.com> wrote:
> > Shawn O. Pearce wrote:
> >> Nick Edelen <sirnot@gmail.com> wrote:
> >>> Hrmm, I just realized that it dosn't actually cache paths/names...
> >>
> >> You may not need the path name, but instead the hash value that 
> >> pack-objects computes from the path name.
> >
> > Please do NOT expose the hash values. The hash used by pack-objects is 
> > an implementation detail of the heuristics used by the _current_ 
> > object packing code. It would be a real shame to have to maintain 
> > backward compatibility with it at some future date after the packing 
> > machinery has changed.
> 
> This is a local cache.  If there was a version number in the header, and 
> the hash function changes, we could just bump the version number and 
> invalidate all of the caches.
> 
> No sense in storing (and doing IO of) huge duplicate string values for 
> something where we really only need 32 bits, and where a recompute from 
> scratch only costs a minute.

FWIW it was this redundancy in duplicate (unpacked) string redundancy 
I meant, but I did a poor job at expressing myself, and consequently Nick 
did not understand what I want (and I'm on a slow connection, so I deleted 
the mail halfway through looking if there are some real answers hidden in 
the huge quoted part instead of replying).

And the fragility of the dependency to the implementation detail of the 
pack index suggests to me that my intuition that this whole thing should 
be more tightly integrated with the pack index was not totally off the 
mark.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-06 20:30           ` Nick Edelen
  2009-08-06 20:32             ` Shawn O. Pearce
@ 2009-08-07  4:42             ` Nicolas Pitre
  1 sibling, 0 replies; 34+ messages in thread
From: Nicolas Pitre @ 2009-08-07  4:42 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Johannes Schindelin, Michael J Gruber, Junio C Hamano, Jeff King,
	Sam Vilain, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Thu, 6 Aug 2009, Nick Edelen wrote:

> Hrmm, I just realized that it dosn't actually cache paths/names...
> This obviously has no bearing on its use in packing, but I should
> either add that in or restrict usage in non-packing-related walks.
> Weird how things like that escape you.

Actually it is really the packing related walk that would benefit the 
most from this work.  The "counting objects" phase of a clone may take 
quite a while with some repositories.  Most other operations don't care 
as much because the rev walk is done incrementally whereas the packing 
operation needs to perform it all up front.

Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 19:06       ` Johannes Schindelin
  2009-08-06 20:01         ` Nick Edelen
@ 2009-08-07  2:47         ` Sam Vilain
  2009-08-07  4:35           ` Nicolas Pitre
  2009-08-07  6:12           ` Johannes Schindelin
  2009-08-08 18:57         ` Junio C Hamano
  2 siblings, 2 replies; 34+ messages in thread
From: Sam Vilain @ 2009-08-07  2:47 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nick Edelen, Michael J Gruber, Junio C Hamano, Jeff King,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Johannes Schindelin wrote:
>> the short answer is that cache slices are totally independant of pack 
>> files.
>>     
>
> My idea with that was that you already have a SHA-1 map in the pack index, 
> and if all you want to be able to accelerate the revision walker, you'd 
> probably need something that adds yet another mapping, from commit to 
> parents and tree, and from tree to sub-tree and blob (so you can avoid 
> unpacking commit and tree objects).
>   

Tying indexes together like that is not a good idea in the database
world. Especially as in this case as Nick mentions, the domain is subtly
different (ie pack vs dag). Unfortunately you just can't try to pretend
that they will always be the same; you can't force a full repack on
every ref change!

> Still, there is some redundancy between the pack index and your cache, as 
> you have to write out the whole list of SHA-1s all over again.  I guess it 
> is time to look at the code instead of asking stupid questions.
>   

"Disk is cheap" :-) It should be a welcome trade-off; perhaps it's worth
including numbers about how big the indexes are with the time
statistics. It sounds though like it should be a significant win as a
single index can be used to accelerate a wide range of rev-list
operations, and store indexes for many different questions that can be
asked.

Sam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07  2:47         ` Sam Vilain
@ 2009-08-07  4:35           ` Nicolas Pitre
  2009-08-07  6:08             ` Johannes Schindelin
  2009-08-07  6:12           ` Johannes Schindelin
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2009-08-07  4:35 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Johannes Schindelin, Nick Edelen, Michael J Gruber,
	Junio C Hamano, Jeff King, Shawn O. Pearce, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

On Fri, 7 Aug 2009, Sam Vilain wrote:

> Johannes Schindelin wrote:
> >> the short answer is that cache slices are totally independant of pack 
> >> files.
> >>     
> >
> > My idea with that was that you already have a SHA-1 map in the pack index, 
> > and if all you want to be able to accelerate the revision walker, you'd 
> > probably need something that adds yet another mapping, from commit to 
> > parents and tree, and from tree to sub-tree and blob (so you can avoid 
> > unpacking commit and tree objects).
> >   
> 
> Tying indexes together like that is not a good idea in the database
> world. Especially as in this case as Nick mentions, the domain is subtly
> different (ie pack vs dag). Unfortunately you just can't try to pretend
> that they will always be the same; you can't force a full repack on
> every ref change!

Right.  And the rev cache must work even if the repository is not 
packed. So pack index and rev caching are orthogonal things and are best 
kept separate on disk.

How big this cache might get would be interesting indeed.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07  4:35           ` Nicolas Pitre
@ 2009-08-07  6:08             ` Johannes Schindelin
  2009-08-07 14:18               ` Nicolas Pitre
  0 siblings, 1 reply; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-07  6:08 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

On Fri, 7 Aug 2009, Nicolas Pitre wrote:

> On Fri, 7 Aug 2009, Sam Vilain wrote:
> 
> > Johannes Schindelin wrote:
> > >> the short answer is that cache slices are totally independant of 
> > >> pack files.
> > >>     
> > >
> > > My idea with that was that you already have a SHA-1 map in the pack 
> > > index, and if all you want to be able to accelerate the revision 
> > > walker, you'd probably need something that adds yet another mapping, 
> > > from commit to parents and tree, and from tree to sub-tree and blob 
> > > (so you can avoid unpacking commit and tree objects).
> > >   
> > 
> > Tying indexes together like that is not a good idea in the database 
> > world. Especially as in this case as Nick mentions, the domain is 
> > subtly different (ie pack vs dag). Unfortunately you just can't try to 
> > pretend that they will always be the same; you can't force a full 
> > repack on every ref change!
> 
> Right.  And the rev cache must work even if the repository is not 
> packed.

Umm, why?  AFAICT the principal purpose of the rev cache is to help work 
loads on, say, www.kernel.org.

I am unlikely to notice the improvements in my regular "git log" calls 
that only show a couple of pages before I quit the pager.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07  6:08             ` Johannes Schindelin
@ 2009-08-07 14:18               ` Nicolas Pitre
  2009-08-08 15:18                 ` Johannes Schindelin
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2009-08-07 14:18 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Fri, 7 Aug 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Fri, 7 Aug 2009, Nicolas Pitre wrote:
> 
> > On Fri, 7 Aug 2009, Sam Vilain wrote:
> > 
> > > Johannes Schindelin wrote:
> > > >> the short answer is that cache slices are totally independant of 
> > > >> pack files.
> > > >>     
> > > >
> > > > My idea with that was that you already have a SHA-1 map in the pack 
> > > > index, and if all you want to be able to accelerate the revision 
> > > > walker, you'd probably need something that adds yet another mapping, 
> > > > from commit to parents and tree, and from tree to sub-tree and blob 
> > > > (so you can avoid unpacking commit and tree objects).
> > > >   
> > > 
> > > Tying indexes together like that is not a good idea in the database 
> > > world. Especially as in this case as Nick mentions, the domain is 
> > > subtly different (ie pack vs dag). Unfortunately you just can't try to 
> > > pretend that they will always be the same; you can't force a full 
> > > repack on every ref change!
> > 
> > Right.  And the rev cache must work even if the repository is not 
> > packed.
> 
> Umm, why?  AFAICT the principal purpose of the rev cache is to help work 
> loads on, say, www.kernel.org.

So what?

Speeding up rev-list with a rev cache is completely orthogonal to 
whether the repository is packed or not.  It is like having a "git diff" 
result cache: no one would think of stuffing that in the pack index.

If we want to improve on the repository packing format, that must be 
doable without bothering with an independent concept such as a rev 
cache.

> I am unlikely to notice the improvements in my regular "git log" calls 
> that only show a couple of pages before I quit the pager.

Indeed.  But what is your point again?


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07 14:18               ` Nicolas Pitre
@ 2009-08-08 15:18                 ` Johannes Schindelin
  2009-08-08 16:07                   ` Junio C Hamano
                                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-08 15:18 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

On Fri, 7 Aug 2009, Nicolas Pitre wrote:

> On Fri, 7 Aug 2009, Johannes Schindelin wrote:
> 
> > Hi,
> > 
> > On Fri, 7 Aug 2009, Nicolas Pitre wrote:
> > 
> > > On Fri, 7 Aug 2009, Sam Vilain wrote:
> > > 
> > > > Johannes Schindelin wrote:
> > > > >> the short answer is that cache slices are totally independant of 
> > > > >> pack files.
> > > > >>     
> > > > >
> > > > > My idea with that was that you already have a SHA-1 map in the pack 
> > > > > index, and if all you want to be able to accelerate the revision 
> > > > > walker, you'd probably need something that adds yet another mapping, 
> > > > > from commit to parents and tree, and from tree to sub-tree and blob 
> > > > > (so you can avoid unpacking commit and tree objects).
> > > > >   
> > > > 
> > > > Tying indexes together like that is not a good idea in the database 
> > > > world. Especially as in this case as Nick mentions, the domain is 
> > > > subtly different (ie pack vs dag). Unfortunately you just can't try to 
> > > > pretend that they will always be the same; you can't force a full 
> > > > repack on every ref change!
> > > 
> > > Right.  And the rev cache must work even if the repository is not 
> > > packed.
> > 
> > Umm, why?  AFAICT the principal purpose of the rev cache is to help work 
> > loads on, say, www.kernel.org.
> 
> So what?
> 
> Speeding up rev-list with a rev cache is completely orthogonal to 
> whether the repository is packed or not.

No, it is not.

For both technical and practical reasons, caching revision walker data is
very closely related to packing.

You are _very_ unlikely helped by speeding up revision walking in the 
general case, _especially_ when you do stuff like blame or -S that needs 
to unpack tons of objects _anyway_.

The one big kicker argument for speeding up revision walking _is_ to 
relieve the loads on big ass servers, and they _should_ be as packed as 
possible (as I will patiently explain over and over again).

> It is like having a "git diff" result cache: no one would think of 
> stuffing that in the pack index.

Do you want to try to kid me?  You'll have to try harder.  Caching "git 
diff" results... no, really!

> If we want to improve on the repository packing format, that must be 
> doable without bothering with an independent concept such as a rev 
> cache.

I would love to tell you that you're right, but the single fact that 
pack v4 is startig to compete with Duke Nukem Forever just prevents me 
from doing that.

> > I am unlikely to notice the improvements in my regular "git log" calls 
> > that only show a couple of pages before I quit the pager.
> 
> Indeed.  But what is your point again?

Oh?  My point?  Being that the rev cache has a certain target audience, 
and that the regular user is not part of that audience, and that it just 
so happens that the _technical_ similarities with the pack index can be 
exploited in those scenarios?

IOW we can be pretty certain that a heavy-load server has a fully (or 
next-to-fully) packed object database.  The pack indices already contain a 
SHA-1 table that we can simply reuse.  And it should not be hard (or 
fragile) at all to put the "cached" information about parents, 
referenced tree and blob objects into that file, into a different section.

After all, the parents, referenced tree and blob objects to change as 
often as the objects in the pack: never.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-08 15:18                 ` Johannes Schindelin
@ 2009-08-08 16:07                   ` Junio C Hamano
  2009-08-08 23:54                   ` Sam Vilain
  2009-08-09  2:37                   ` Nicolas Pitre
  2 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2009-08-08 16:07 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Sam Vilain, Nick Edelen, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> It is like having a "git diff" result cache: no one would think of 
>> stuffing that in the pack index.
>
> Do you want to try to kid me?  You'll have to try harder.  Caching "git 
> diff" results... no, really!

I thought Nico meant caching the rename similarity matrix when he said
"git diff" cache.  In a narrow context of "log --follow" it may make
sense.  It also might help "blame" without -C.

But I do not see how we would gain anything if we had that cache tied to
the pack index.

> After all, the parents, referenced tree and blob objects to change as 
> often as the objects in the pack: never.

But I notice that the aspects of objects you listed: the parents,
referenced tree and blob objects.  The frequency of them changing does not
depend on where the actual object is, either packed or loose.  These
aspects of an object (more specifically, you are talking about a commit
object) never change either way.  So I am somewhat puzzled.

It does change if a particular commit and its associated objects are
relevant to the traversal as refs change (especially when they rewind).
Just like an old "kept" pack can suddenly have tons of irrelevant objects
after refs are pruned (e.g. a branch is dropped), cached reachability
data, even though they may stay correct, would become irrelevant when
nobody starts traversing from an object that is no longer reachable.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-08 15:18                 ` Johannes Schindelin
  2009-08-08 16:07                   ` Junio C Hamano
@ 2009-08-08 23:54                   ` Sam Vilain
  2009-08-09  2:37                   ` Nicolas Pitre
  2 siblings, 0 replies; 34+ messages in thread
From: Sam Vilain @ 2009-08-08 23:54 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Sat, 2009-08-08 at 17:18 +0200, Johannes Schindelin wrote:
> > Speeding up rev-list with a rev cache is completely orthogonal to 
> > whether the repository is packed or not.
> 
> No, it is not.
> 
> For both technical and practical reasons, caching revision walker data is
> very closely related to packing.
> [...]
> ... the rev cache has a certain target audience, 
> and that the regular user is not part of that audience, and that it just 
> so happens that the _technical_ similarities with the pack index can be 
> exploited in those scenarios?
> 
> IOW we can be pretty certain that a heavy-load server has a fully (or 
> next-to-fully) packed object database.  The pack indices already contain a 
> SHA-1 table that we can simply reuse.  And it should not be hard (or 
> fragile) at all to put the "cached" information about parents, 
> referenced tree and blob objects into that file, into a different section.

I think your argument would work better if packs and bundles were the
same thing, and we always stored bundles in the objects/packs directory,
but they're not and we don't.  You can't assume that a pack has any
particular properties, such as representing the objects returned from a
single rev-cache walk.  And I will say that *especially* on a busy git
server, serving active projects you can't expect people to repack their
repository for every single update.  Repacking daily or so by a batch
job, sure.  Expecting the repository to always be fully packed?  No.
Too much churn, or inefficient packing.  You can't just pretend that the
mixed packed/loose case doesn't exist.

The 10% size seems a very good bang for your buck to me and a good
start.

Sam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-08 15:18                 ` Johannes Schindelin
  2009-08-08 16:07                   ` Junio C Hamano
  2009-08-08 23:54                   ` Sam Vilain
@ 2009-08-09  2:37                   ` Nicolas Pitre
  2009-08-09 13:42                     ` Nick Edelen
  2 siblings, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2009-08-09  2:37 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Sat, 8 Aug 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Fri, 7 Aug 2009, Nicolas Pitre wrote:
> 
> > On Fri, 7 Aug 2009, Johannes Schindelin wrote:
> > 
> > > Hi,
> > > 
> > > On Fri, 7 Aug 2009, Nicolas Pitre wrote:
> > > 
> > > > On Fri, 7 Aug 2009, Sam Vilain wrote:
> > > > 
> > > > > Johannes Schindelin wrote:
> > > > > >> the short answer is that cache slices are totally independant of 
> > > > > >> pack files.
> > > > > >>     
> > > > > >
> > > > > > My idea with that was that you already have a SHA-1 map in the pack 
> > > > > > index, and if all you want to be able to accelerate the revision 
> > > > > > walker, you'd probably need something that adds yet another mapping, 
> > > > > > from commit to parents and tree, and from tree to sub-tree and blob 
> > > > > > (so you can avoid unpacking commit and tree objects).
> > > > > >   
> > > > > 
> > > > > Tying indexes together like that is not a good idea in the database 
> > > > > world. Especially as in this case as Nick mentions, the domain is 
> > > > > subtly different (ie pack vs dag). Unfortunately you just can't try to 
> > > > > pretend that they will always be the same; you can't force a full 
> > > > > repack on every ref change!
> > > > 
> > > > Right.  And the rev cache must work even if the repository is not 
> > > > packed.
> > > 
> > > Umm, why?  AFAICT the principal purpose of the rev cache is to help work 
> > > loads on, say, www.kernel.org.
> > 
> > So what?
> > 
> > Speeding up rev-list with a rev cache is completely orthogonal to 
> > whether the repository is packed or not.
> 
> No, it is not.
> 
> For both technical and practical reasons, caching revision walker data is
> very closely related to packing.

No it is not.

> You are _very_ unlikely helped by speeding up revision walking in the 
> general case, _especially_ when you do stuff like blame or -S that needs 
> to unpack tons of objects _anyway_.

I completely agree.  And this is why I wish for the rev cache to be 
enabled with a config variable.  Why?  Because "client" repositories are 
unlikely to benefit as much as "server" repositories from this cache.

> The one big kicker argument for speeding up revision walking _is_ to 
> relieve the loads on big ass servers, and they _should_ be as packed as 
> possible (as I will patiently explain over and over again).

Please do a shortlog on builtin-pack-objects.c and realize who spent a 
lot of energy making git repository packing what it is now.

Then re-read what I said on this list for more than a year now about the 
remaining latency on large git clone operations.  After explaining it a 
couple times already I will need some of your patience to make people 
understand that fully packing your repo is _not_ going to help it as 
we're always talking about fully packed repos to start with.

> > It is like having a "git diff" result cache: no one would think of 
> > stuffing that in the pack index.
> 
> Do you want to try to kid me?  You'll have to try harder.  Caching "git 
> diff" results... no, really!

Why not?  You worked on git diff yourself, so you certainly are well 
positionned to appreciate my suggestion, no?

> > If we want to improve on the repository packing format, that must be 
> > doable without bothering with an independent concept such as a rev 
> > cache.
> 
> I would love to tell you that you're right, but the single fact that 
> pack v4 is startig to compete with Duke Nukem Forever just prevents me 
> from doing that.

Because of the current economy, I was waiting to be laid off so to have 
the time to make pack v4 a reality.  Unfortunately for git they didn't 
fire me yet.  Oh well...

> > > I am unlikely to notice the improvements in my regular "git log" calls 
> > > that only show a couple of pages before I quit the pager.
> > 
> > Indeed.  But what is your point again?
> 
> Oh?  My point?  Being that the rev cache has a certain target audience, 
> and that the regular user is not part of that audience, and that it just 
> so happens that the _technical_ similarities with the pack index can be 
> exploited in those scenarios?

I don't see a similarity with the pack index at all, certainly not a 
technical one.

> IOW we can be pretty certain that a heavy-load server has a fully (or 
> next-to-fully) packed object database.  The pack indices already contain a 
> SHA-1 table that we can simply reuse.  And it should not be hard (or 
> fragile) at all to put the "cached" information about parents, 
> referenced tree and blob objects into that file, into a different section.

And then someone does a few pushes.  So most of the time your repository 
contains a few packs and not only a single one.  So in which pack index 
files should you put the rev cache?  What do yo do with that cache if it 
happens to be split across multiple pack index files when a repack is 
performed?  Can't you see all the disadvantages to tie rev cache data 
which happens to share no issue with repacking into the same file?

So what do you do to keep the code simple and maintainable?  Yes, you 
abstract things and use separate files, thank you very much.  After all, 
the packed refs are not stored in the pack index file either even if the 
packed-refs file contains a list of SHA1's.

Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-09  2:37                   ` Nicolas Pitre
@ 2009-08-09 13:42                     ` Nick Edelen
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Edelen @ 2009-08-09 13:42 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Sam Vilain, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

I can see the logic behind Johannes's ideas, but I'm still not sure
it'd be a great modification.  If you wanted to associate revision
caching much more strongly with packs, then packs and slices could be
merged reasonably well.  Say you just attached the actual slice data
at the end of the pack, then stored offsets of the slice payload in
the pack index.  Since you'd (presumably) have to search the index for
the object anyway, you wouldn't have to deal with searching a
rev-cache index on top of that (although it's not exactly unoptimized
now).

However, that would sorta be preemptively limiting rev-cache to
pack-related optimizations.  I mean at the moment that's the main
target, but it could be improved in the future to be more relavant to
other operations as well.  Leaving the rev-cache as a seperate system
would keep both it and packing much more flexible, and open to
longer-term developments.

>I haven't read the side of the patch that _uses_ the information stored in
>the rev-cache to figure out what it optimizes and what its limitations are
>(e.g. how it interacts with pathspecs).  Perhaps the rev-cache may turn
>out to be _only_ useful for pack-objects and nothing else, in which case
>we may not care about standalone version of rev-cache generator after all.

rev-cache's cache slice traversal basically emulates git's revision
walker, on a smaller scale.  At the moment it only really handles date
limiting (and obviously slop stuff) so it's not used for any pruning.
That's not to say it couldn't be updated in the future though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-07  2:47         ` Sam Vilain
  2009-08-07  4:35           ` Nicolas Pitre
@ 2009-08-07  6:12           ` Johannes Schindelin
  2009-08-07 15:00             ` Nicolas Pitre
  1 sibling, 1 reply; 34+ messages in thread
From: Johannes Schindelin @ 2009-08-07  6:12 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nick Edelen, Michael J Gruber, Junio C Hamano, Jeff King,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

On Fri, 7 Aug 2009, Sam Vilain wrote:

> Johannes Schindelin wrote:
> >> the short answer is that cache slices are totally independant of pack 
> >> files.
> >>     
> >
> > My idea with that was that you already have a SHA-1 map in the pack 
> > index, and if all you want to be able to accelerate the revision 
> > walker, you'd probably need something that adds yet another mapping, 
> > from commit to parents and tree, and from tree to sub-tree and blob 
> > (so you can avoid unpacking commit and tree objects).
> >   
> 
> Tying indexes together like that is not a good idea in the database
> world.

This is not the same index as in the database world.  It is more 
comparable with a cached view.  And there, it is generally a good idea to 
keep related things in the same cached view (with an outer join), rather 
than having two primary keys for almost every record.

> Especially as in this case as Nick mentions, the domain is subtly 
> different (ie pack vs dag). Unfortunately you just can't try to pretend 
> that they will always be the same; you can't force a full repack on 
> every ref change!

No, but you do not need that, either.  In the setting that is most likely 
the most thankful one, i.e. a git:// server, you _want_ to keep the 
repository "as packed as possible", otherwise the rev cache improvements 
will be lost in the bad packing performance anyway.

> > Still, there is some redundancy between the pack index and your cache, 
> > as you have to write out the whole list of SHA-1s all over again.  I 
> > guess it is time to look at the code instead of asking stupid 
> > questions.
> >   
> 
> "Disk is cheap" :-)

Disk I/O ain't.

(Size of the I/O caches, yaddayadda, I'm sure you get my point).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07  6:12           ` Johannes Schindelin
@ 2009-08-07 15:00             ` Nicolas Pitre
  2009-08-07 22:02               ` Nick Edelen
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2009-08-07 15:00 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Nick Edelen, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Fri, 7 Aug 2009, Johannes Schindelin wrote:

> On Fri, 7 Aug 2009, Sam Vilain wrote:
> 
> > Especially as in this case as Nick mentions, the domain is subtly 
> > different (ie pack vs dag). Unfortunately you just can't try to pretend 
> > that they will always be the same; you can't force a full repack on 
> > every ref change!
> 
> No, but you do not need that, either.  In the setting that is most likely 
> the most thankful one, i.e. a git:// server, you _want_ to keep the 
> repository "as packed as possible", otherwise the rev cache improvements 
> will be lost in the bad packing performance anyway.

Yes and no.

Currently, the number #1 latency in any initial git clone is the famous 
"counting objects" phase, even if the repo is perfectly packed.  And 
that's all this rev cache can and will improve.  The packing does play 
its performance role of course, but for a totally different reason.  
Hence the repository needs no be perfectly packed for a rev cache to 
speed up its own part of the game.

> > > Still, there is some redundancy between the pack index and your cache, 
> > > as you have to write out the whole list of SHA-1s all over again.  I 
> > > guess it is time to look at the code instead of asking stupid 
> > > questions.
> > >   
> > 
> > "Disk is cheap" :-)
> 
> Disk I/O ain't.
> 
> (Size of the I/O caches, yaddayadda, I'm sure you get my point).

I don't know about the size of the rev cache on disk yet (I asked Nick 
about that) nor do I really know how this cache is implemented.  But I 
know damn well about git packs and associated index and I for sure don't 
want to see a revision cache coupled with it.

And for a clone the disk IO will certainly be a magnitude larger than 
for the cache (or so I hope).  Maybe the IO for the rev cache might be a 
significant overhead for operations performed on a client (aka 
developer) repository, in which case it would be a good idea to have a 
config variable to control the cache size, or even to turn it off 
entirely.  We do it for delta depth and many other things already.

Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-07 15:00             ` Nicolas Pitre
@ 2009-08-07 22:02               ` Nick Edelen
  2009-08-07 22:48                 ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-07 22:02 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Sam Vilain, Michael J Gruber, Junio C Hamano,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

> I don't know about the size of the rev cache on disk yet (I asked Nick
> about that) nor do I really know how this cache is implemented.

The cache file for all of the linux repository (as of a few weeks ago)
is around 42MB, without names.  The names would probably add 2 or 3 MB
on top of that.  That's probably about as big as I'd want to get, as
the whole slice (minus the name list) is mapped to memory (then again
bigger might be ok; I'm not an expert on mem mapping).  The rev-cache
command fuse provides functionality to ignore certain cache sizes,
which was geared towards preventing overly large slices.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-07 22:02               ` Nick Edelen
@ 2009-08-07 22:48                 ` Junio C Hamano
  2009-08-07 22:53                   ` Nick Edelen
  2009-08-08  2:50                   ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Junio C Hamano @ 2009-08-07 22:48 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Junio C Hamano, Jeff King, Shawn O. Pearce, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

Nick Edelen <sirnot@gmail.com> writes:

> The cache file for all of the linux repository (as of a few weeks ago)
> is around 42MB, without names.  The names would probably add 2 or 3 MB
> on top of that.  That's probably about as big as I'd want to get,

Hmm.  .git/objects/ as of today is about 482M here, so we are talking
about roughly 10% overhead?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-07 22:48                 ` Junio C Hamano
@ 2009-08-07 22:53                   ` Nick Edelen
  2009-08-08  3:11                     ` Junio C Hamano
  2009-08-08  2:50                   ` Jeff King
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-07 22:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

> Hmm.  .git/objects/ as of today is about 482M here, so we are talking
> about roughly 10% overhead?

Yes, that sounds about right.  The cache file for git's repository is
3MB, and my repo (partly packed) is ~35MB.

By the way, what would be the best way of posting a revised patchset?
Should I just reply to my older posts, or make new ones?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-07 22:53                   ` Nick Edelen
@ 2009-08-08  3:11                     ` Junio C Hamano
  2009-08-08  7:27                       ` Nick Edelen
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2009-08-08  3:11 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Nick Edelen <sirnot@gmail.com> writes:

> By the way, what would be the best way of posting a revised patchset?
> Should I just reply to my older posts, or make new ones?

That depends primarily on how heavily the patches needed to change in
response to review comments, but until the series lands in 'next', you
would typically send updated series as a replacement, not incremental.

Many people seemed to be interested in the series and had a volume of
comments on it.  I suspect the updated series would be quite different
from the original, so for the next round I would suspect it would be best
to start anew, marking them as [PATCH N/M (v2)], in a fresh thread.  It
would help reviewers if you said "this corresponds to [PATCH 3/5] in the
original series, with the following improvements based on X and Y's
comments" after the three-dash line.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-08  3:11                     ` Junio C Hamano
@ 2009-08-08  7:27                       ` Nick Edelen
  2009-08-08  7:30                         ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Edelen @ 2009-08-08  7:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

> IIRC from previous discussions, kernel.org's main performance problem is
> I/O, not CPU. Are there any provisions for sharing rev-caches between
> similar repositories, as we already do for objects?

I haven't implemented a transmission protocol or anything, but it
would be perfectly possible to copy cache slices from one repo to
another.  Generating the revision cache from scratch on large repos
can take several minutes, so this wouldn't be a bad idea.

> That depends primarily on how heavily the patches needed to change in
> response to review comments, but until the series lands in 'next', you
> would typically send updated series as a replacement, not incremental.
>
> Many people seemed to be interested in the series and had a volume of
> comments on it.  I suspect the updated series would be quite different
> from the original, so for the next round I would suspect it would be best
> to start anew, marking them as [PATCH N/M (v2)], in a fresh thread.  It
> would help reviewers if you said "this corresponds to [PATCH 3/5] in the
> original series, with the following improvements based on X and Y's
> comments" after the three-dash line.

Ok, that sounds good.  I've added a new patch as well so the numbering changes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-08  7:27                       ` Nick Edelen
@ 2009-08-08  7:30                         ` Jeff King
  2009-08-08  7:40                           ` Nick Edelen
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2009-08-08  7:30 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Junio C Hamano, Nicolas Pitre, Johannes Schindelin, Sam Vilain,
	Michael J Gruber, Shawn O. Pearce, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

On Sat, Aug 08, 2009 at 09:27:55AM +0200, Nick Edelen wrote:

> > IIRC from previous discussions, kernel.org's main performance problem is
> > I/O, not CPU. Are there any provisions for sharing rev-caches between
> > similar repositories, as we already do for objects?
> 
> I haven't implemented a transmission protocol or anything, but it
> would be perfectly possible to copy cache slices from one repo to
> another.  Generating the revision cache from scratch on large repos
> can take several minutes, so this wouldn't be a bad idea.

That might be useful, but I was thinking more of an "alternates"-like
mechanism between repos. So that the data is stored only once on disk
and in the disk cache, which is helpful for sites like kernel.org which
serve many similar repositories.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-08  7:30                         ` Jeff King
@ 2009-08-08  7:40                           ` Nick Edelen
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Edelen @ 2009-08-08  7:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Nicolas Pitre, Johannes Schindelin, Sam Vilain,
	Michael J Gruber, Shawn O. Pearce, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

> That might be useful, but I was thinking more of an "alternates"-like
> mechanism between repos. So that the data is stored only once on disk
> and in the disk cache, which is helpful for sites like kernel.org which
> serve many similar repositories.

Oh, right.  Yes, that seems like it could work.  We'd have to be
careful that a shared cache slice wouldn' change (like in a fuse or
something), but other than that we could have something as simple as a
link.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking
  2009-08-07 22:48                 ` Junio C Hamano
  2009-08-07 22:53                   ` Nick Edelen
@ 2009-08-08  2:50                   ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff King @ 2009-08-08  2:50 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Junio C Hamano, Nicolas Pitre, Johannes Schindelin, Sam Vilain,
	Michael J Gruber, Shawn O. Pearce, Andreas Ericsson,
	Christian Couder, git@vger.kernel.org

On Fri, Aug 07, 2009 at 03:48:51PM -0700, Junio C Hamano wrote:

> > The cache file for all of the linux repository (as of a few weeks ago)
> > is around 42MB, without names.  The names would probably add 2 or 3 MB
> > on top of that.  That's probably about as big as I'd want to get,
> 
> Hmm.  .git/objects/ as of today is about 482M here, so we are talking
> about roughly 10% overhead?

IIRC from previous discussions, kernel.org's main performance problem is
I/O, not CPU. Are there any provisions for sharing rev-caches between
similar repositories, as we already do for objects?

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Suggested for PU: revision caching system to  significantly speed up packing/walking
  2009-08-06 19:06       ` Johannes Schindelin
  2009-08-06 20:01         ` Nick Edelen
  2009-08-07  2:47         ` Sam Vilain
@ 2009-08-08 18:57         ` Junio C Hamano
  2 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2009-08-08 18:57 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nick Edelen, Michael J Gruber, Junio C Hamano, Jeff King,
	Sam Vilain, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> My idea with that was that you already have a SHA-1 map in the pack index, 
> and if all you want to be able to accelerate the revision walker, you'd 
> probably need something that adds yet another mapping, from commit to 
> parents and tree, and from tree to sub-tree and blob (so you can avoid 
> unpacking commit and tree objects).
>
> I just thought that it could be more efficient to do it at the time the 
> pack index is written _anyway_, as nothing will change in the pack after 
> that anyway.

After reading the version 2 of the "documentation" patch and commenting
heavily on it, I partly share the same feeling with you.  The codepath to
pack objects is _one of the places_ you can generate rev-cache and slice
information without redoing a lot of work that has already been done
anyway.

But

 - You can write that information separately out to a different file.
   Logically it does not have to be _in_ the same pack idx file; and

 - You may want to generate rev-cache information even if you do not pack
   the repository.  They may practically go hand-in-hand, but logically
   they are orthogonal.

And I am not sure if it is easy to retrofit "rev-list | pack-objects" code
to additionally produce this information, while keeping the standalone
version of rev-cache generation.

Having said all that.

I haven't read the side of the patch that _uses_ the information stored in
the rev-cache to figure out what it optimizes and what its limitations are
(e.g. how it interacts with pathspecs).  Perhaps the rev-cache may turn
out to be _only_ useful for pack-objects and nothing else, in which case
we may not care about standalone version of rev-cache generator after all.

If that is the case, I think it is also a reasonable implementation if the
rev-cache is generated only by "rev-list | pack-objects" codepath as a
side effect of traversal it already does, and it might even make sense to
introduce the version 3 of pack idx format that let you record additional
information, like you suggest.  I am not ready to make that judgement as I
haven't read the rest, but my gut feeling tells me that you might be
right.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-08-09 13:42 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-06  9:55 [PATCH 0/5] Suggested for PU: revision caching system to significantly speed up packing/walking Nick Edelen
2009-08-06 14:48 ` Johannes Schindelin
2009-08-06 14:58   ` Michael J Gruber
2009-08-06 17:39     ` Nick Edelen
2009-08-06 19:06       ` Johannes Schindelin
2009-08-06 20:01         ` Nick Edelen
2009-08-06 20:30           ` Nick Edelen
2009-08-06 20:32             ` Shawn O. Pearce
2009-08-06 23:35               ` A Large Angry SCM
2009-08-06 23:37                 ` Shawn O. Pearce
2009-08-06 23:43                   ` A Large Angry SCM
2009-08-07  0:15                     ` Nick Edelen
2009-08-07  6:05                   ` Johannes Schindelin
2009-08-07  4:42             ` Nicolas Pitre
2009-08-07  2:47         ` Sam Vilain
2009-08-07  4:35           ` Nicolas Pitre
2009-08-07  6:08             ` Johannes Schindelin
2009-08-07 14:18               ` Nicolas Pitre
2009-08-08 15:18                 ` Johannes Schindelin
2009-08-08 16:07                   ` Junio C Hamano
2009-08-08 23:54                   ` Sam Vilain
2009-08-09  2:37                   ` Nicolas Pitre
2009-08-09 13:42                     ` Nick Edelen
2009-08-07  6:12           ` Johannes Schindelin
2009-08-07 15:00             ` Nicolas Pitre
2009-08-07 22:02               ` Nick Edelen
2009-08-07 22:48                 ` Junio C Hamano
2009-08-07 22:53                   ` Nick Edelen
2009-08-08  3:11                     ` Junio C Hamano
2009-08-08  7:27                       ` Nick Edelen
2009-08-08  7:30                         ` Jeff King
2009-08-08  7:40                           ` Nick Edelen
2009-08-08  2:50                   ` Jeff King
2009-08-08 18:57         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).