Ideas to speed up repacking

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Ideas to speed up repacking
@ 2013-12-02 23:30 Martin Fick
  2013-12-03  0:44 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Fick @ 2013-12-02 23:30 UTC (permalink / raw)
  To: git

I wanted to explore the idea of exploiting knowledge about 
previous repacks to help speed up future repacks.  

I had various ideas that seemed like they might be good 
places to start, but things quickly got away from me.  
Mainly I wanted to focus on reducing and even sometimes 
eliminating reachability calculations since that seems to be 
be the one major unsolved slow piece during repacking.

My first line of thinking goes like this:  "After a full 
repack, reachability of the current refs is known.  Exploit 
that knowledge for future repacks."  There are some very 
simple scenarios where if we could figure out how to 
identify them reliably, I think we could simply avoid 
reachability calculations entirely, and yet end up with the 
same repacked files as if we had done the reachability 
calculations.  Let me outline some to see if they make sense 
as starting place for further discussion.

-------------

* Setup 1:  

  Do a full repack.  All loose and packed objects are added 
to a single pack file (assumes git config repack options do 
not create multiple packs).

* Scenario 1:

  Start with Setup 1.  Nothing has changed on the repo 
contents (no new object/packs, refs all the same), but 
repacking config options have changed (for example 
compression level has changed).

* Scenario 2:

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by adding a new ref to the repo (existing 
refs did not change).

* Scenario 3: 

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by updating an existing ref with a fast 
forward.

* Scenario 4:

   Starts with Setup 1.  Add some loose objects to the repo 
via a local fast forward ref update (I am assuming this is 
possible without adding any new unreferenced objects?)

In all 4 scenarios, I believe we should be able to skip 
history traversal and simply grab all objects and repack 
them into a new file?

-------------

Of the 4 scenarios above, it seems like #3 and #4 are very 
common operations (#2 is perhaps even more common for 
Gerrit)?  If these scenarios can be reliably identified 
somehow, then perhaps they could be used to reduce repacking 
time for these scenarios, and later used as building blocks 
to reduce repacking time for other related but slightly more 
complicated scenarios (with reduced history walking instead 
of none)?

For example to identify scenario 1, what if we kept a copy 
of all refs and their shas used during a full repack along 
with the newly repacked file?  A simplistic approach would 
store them in the same format as the packed-refs file as 
pack-<sha>.refs.  During repacking, if none of the refs have 
changed and there are no new objects...  

Then, if none of the refs have changed and there are new 
objects, we can just throw the new objects away?

...

I am going to stop here because this email is long enough 
and I wanted to get some feedback on the ideas first before 
offering more solutions.

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-02 23:30 Ideas to speed up repacking Martin Fick
@ 2013-12-03  0:44 ` Junio C Hamano
  2013-12-03  3:27   ` Duy Nguyen
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-12-03  0:44 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

Martin Fick <mfick@codeaurora.org> writes:

> I wanted to explore the idea of exploiting knowledge about 
> previous repacks to help speed up future repacks.  
>
> I had various ideas that seemed like they might be good 
> places to start, but things quickly got away from me.  
> Mainly I wanted to focus on reducing and even sometimes 
> eliminating reachability calculations since that seems to be 
> be the one major unsolved slow piece during repacking.
>
> My first line of thinking goes like this:  "After a full 
> repack, reachability of the current refs is known.  Exploit 
> that knowledge for future repacks."  There are some very 
> simple scenarios where if we could figure out how to 
> identify them reliably, I think we could simply avoid 
> reachability calculations entirely, and yet end up with the 
> same repacked files as if we had done the reachability 
> calculations.  Let me outline some to see if they make sense 
> as starting place for further discussion.
>
> -------------
>
> * Setup 1:  
>
>   Do a full repack.  All loose and packed objects are added 
> to a single pack file (assumes git config repack options do 
> not create multiple packs).
>
> * Scenario 1:
>
>   Start with Setup 1.  Nothing has changed on the repo 
> contents (no new object/packs, refs all the same), but 
> repacking config options have changed (for example 
> compression level has changed).
>
> * Scenario 2:
>
>    Starts with Setup 1.  Add one new pack file that was 
> pushed to the repo by adding a new ref to the repo (existing 
> refs did not change).
>
> * Scenario 3: 
>
>    Starts with Setup 1.  Add one new pack file that was 
> pushed to the repo by updating an existing ref with a fast 
> forward.
>
> * Scenario 4:
>
>    Starts with Setup 1.  Add some loose objects to the repo 
> via a local fast forward ref update (I am assuming this is 
> possible without adding any new unreferenced objects?)
>
>
> In all 4 scenarios, I believe we should be able to skip 
> history traversal and simply grab all objects and repack 
> them into a new file?

If nothing else has happened in the repository, perhaps, but I
suspect that the real problem is how you would prove it.  For
example, I am guessing that your Scenario 4 could be something like:

    : setup #1
    $ git repack -a -d -f
    $ git prune

    : scenario #4
    $ git commit --allow-empty -m 'new commit'

which would add a single loose object to the repository, advancing
the current branch ref by one commit, fast-forwarding relative to
the state you were in after setup #1.

But how would you efficiently prove that it was the only thing that
happened?  The user could have done this instead of a single commit:

    : scenario #4 look-alike
    $ git commit --allow-empty -m 'lost commit'
    $ git reset --hard HEAD^
    $ git commit --allow-empty -m 'new commit'

and the reflog entry for HEAD or the current branch ref for that
lost commit may be already ancient when you looked at this state.
Your object database has two loose commits, and you would want to
lose the older one 'lost commit' which is not reachable.

Also with Scenario #2, how would you prove that the new pack does
not contain any cruft that is not reachable?  When receiving a pack
and updating our refs, we only prove that we have all the objects
needed to complete updated refs---we do not reject packs with crufts
that are not necessary.

These two are only examples, and we might be able to convince
ourselves that not pruning (or ejecting cruft from packs) is OK, but
that is introducing a different mode of operation, not optimizing
the repacking without changing what "repacking" means (I am not
saying it is bad to change the meaning if we can make a good
argument between pros-and-cons; a small bloat might be acceptable
relative to a good enough performance gain, but only unless the user
is using repack && prune as a way to eradicate undesirable contents
out of the object database).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-03  0:44 ` Junio C Hamano
@ 2013-12-03  3:27   ` Duy Nguyen
  2013-12-03  7:17     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Duy Nguyen @ 2013-12-03  3:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List

On Tue, Dec 3, 2013 at 7:44 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> * Scenario 4:
>>
>>    Starts with Setup 1.  Add some loose objects to the repo
>> via a local fast forward ref update (I am assuming this is
>> possible without adding any new unreferenced objects?)
>>
>>
>> In all 4 scenarios, I believe we should be able to skip
>> history traversal and simply grab all objects and repack
>> them into a new file?
>
> If nothing else has happened in the repository, perhaps, but I
> suspect that the real problem is how you would prove it.  For
> example, I am guessing that your Scenario 4 could be something like:
>
>     : setup #1
>     $ git repack -a -d -f
>     $ git prune
>
>     : scenario #4
>     $ git commit --allow-empty -m 'new commit'
>
> which would add a single loose object to the repository, advancing
> the current branch ref by one commit, fast-forwarding relative to
> the state you were in after setup #1.
>
> But how would you efficiently prove that it was the only thing that
> happened?

Shawn mentioned elsewhere that we could generate bundle header in and
keep it in pack-XXX.bh file at pack creation time. With that
information we could verify if a ref has been reset, just fast
forwarded or even deleted.

> Also with Scenario #2, how would you prove that the new pack does
> not contain any cruft that is not reachable?  When receiving a pack
> and updating our refs, we only prove that we have all the objects
> needed to complete updated refs---we do not reject packs with crufts
> that are not necessary.

We trust the pack producer to do it correctly, I guess. If a pack
producer guarantees not to store any cruft, it could mark the pack
somehow.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-03  3:27   ` Duy Nguyen
@ 2013-12-03  7:17     ` Junio C Hamano
  2013-12-03 10:17       ` Duy Nguyen
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-12-03  7:17 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Martin Fick, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

>> If nothing else has happened in the repository, perhaps, but I
>> suspect that the real problem is how you would prove it.  For
>> example, I am guessing that your Scenario 4 could be something like:
>>
>>     : setup #1
>>     $ git repack -a -d -f
>>     $ git prune
>>
>>     : scenario #4
>>     $ git commit --allow-empty -m 'new commit'
>>
>> which would add a single loose object to the repository, advancing
>> the current branch ref by one commit, fast-forwarding relative to
>> the state you were in after setup #1.
>>
>> But how would you efficiently prove that it was the only thing that
>> happened?
>
> Shawn mentioned elsewhere that we could generate bundle header in and
> keep it in pack-XXX.bh file at pack creation time. With that
> information we could verify if a ref has been reset, just fast
> forwarded or even deleted.

With what information? If you keep the back-then-current information
and nothing else, how would you differentiate between the simple
scenario #4 above vs 'lost and new' two commit versions of the
scenario?  The endpoints should both show that one ref (and only one
ref) advanced by one commit, but one has cruft in the object
database while the other does not.

>> Also with Scenario #2, how would you prove that the new pack does
>> not contain any cruft that is not reachable?  When receiving a pack
>> and updating our refs, we only prove that we have all the objects
>> needed to complete updated refs---we do not reject packs with crufts
>> that are not necessary.
>
> We trust the pack producer to do it correctly, I guess. If a pack
> producer guarantees not to store any cruft, it could mark the pack
> somehow.

That is not an answer.  Since when do we design to blindly trust
anybody on the other end of the wire?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-03  7:17     ` Junio C Hamano
@ 2013-12-03 10:17       ` Duy Nguyen
  2013-12-03 17:50         ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Duy Nguyen @ 2013-12-03 10:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List

On Tue, Dec 3, 2013 at 2:17 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>>> If nothing else has happened in the repository, perhaps, but I
>>> suspect that the real problem is how you would prove it.  For
>>> example, I am guessing that your Scenario 4 could be something like:
>>>
>>>     : setup #1
>>>     $ git repack -a -d -f
>>>     $ git prune
>>>
>>>     : scenario #4
>>>     $ git commit --allow-empty -m 'new commit'
>>>
>>> which would add a single loose object to the repository, advancing
>>> the current branch ref by one commit, fast-forwarding relative to
>>> the state you were in after setup #1.
>>>
>>> But how would you efficiently prove that it was the only thing that
>>> happened?
>>
>> Shawn mentioned elsewhere that we could generate bundle header in and
>> keep it in pack-XXX.bh file at pack creation time. With that
>> information we could verify if a ref has been reset, just fast
>> forwarded or even deleted.
>
> With what information? If you keep the back-then-current information
> and nothing else, how would you differentiate between the simple
> scenario #4 above vs 'lost and new' two commit versions of the
> scenario?  The endpoints should both show that one ref (and only one
> ref) advanced by one commit, but one has cruft in the object
> database while the other does not.

Yeah I was wrong. Reading Martin's mail again I wonder how we just
"grab all objects and skip history traversal". Who will decide object
order in the new pack if we don't traverse history and collect path
information.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-03 10:17       ` Duy Nguyen
@ 2013-12-03 17:50         ` Junio C Hamano
  2013-12-03 19:26           ` Martin Fick
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-12-03 17:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Martin Fick, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> Reading Martin's mail again I wonder how we just
> "grab all objects and skip history traversal". Who will decide object
> order in the new pack if we don't traverse history and collect path
> information.

I vaguely recall raising a related topic for "quick repack, assuming
everything in existing packfiles are reachable, that only removes
loose cruft" several weeks ago.  Once you decide that your quick
repack do not care about ejecting objects from existing packs, like
how I suspect Martin's outline will lead us to, we can repack the
reachable loose ones on the recent surface of the history and then
concatenate the contents of existing packs, excluding duplicates and
possibly adjusting the delta base offsets for some entries, without
traversing the bulk of the history.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ideas to speed up repacking
  2013-12-03 17:50         ` Junio C Hamano
@ 2013-12-03 19:26           ` Martin Fick
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Fick @ 2013-12-03 19:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Git Mailing List

> Martin Fick <mfick@codeaurora.org> writes:
> > * Setup 1:
> >   Do a full repack.  All loose and packed objects are
> >   added
...
> > * Scenario 1:
> >   Start with Setup 1.  Nothing has changed on the repo
> > contents (no new object/packs, refs all the same), but
> > repacking config options have changed (for example
> > compression level has changed).


On Tuesday, December 03, 2013 10:50:07 am Junio C Hamano 
wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
> > Reading Martin's mail again I wonder how we just
> > "grab all objects and skip history traversal". Who will
> > decide object order in the new pack if we don't
> > traverse history and collect path information.
> 
> I vaguely recall raising a related topic for "quick
> repack, assuming everything in existing packfiles are
> reachable, that only removes loose cruft" several weeks
> ago.  Once you decide that your quick repack do not care
> about ejecting objects from existing packs, like how I
> suspect Martin's outline will lead us to, we can repack
> the reachable loose ones on the recent surface of the
> history and then concatenate the contents of existing
> packs, excluding duplicates and possibly adjusting the
> delta base offsets for some entries, without traversing
> the bulk of the history.

>From this, it sounds like scenario 1 (a single pack being 
repacked) might then be doable (just trying to establish a 
really simple baseline)?  Except that it would potentially 
not result in the same ordering without traversing history?  
Or, would the current pack ordering be preserved and thus be 
correct?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-12-03 19:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-02 23:30 Ideas to speed up repacking Martin Fick
2013-12-03  0:44 ` Junio C Hamano
2013-12-03  3:27   ` Duy Nguyen
2013-12-03  7:17     ` Junio C Hamano
2013-12-03 10:17       ` Duy Nguyen
2013-12-03 17:50         ` Junio C Hamano
2013-12-03 19:26           ` Martin Fick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).