Re: [PATCH] pack-objects: reuse data from existing pack.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Junio C Hamano <junkio@cox.net>
To: Andreas Ericsson <ae@op5.se>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] pack-objects: reuse data from existing pack.
Date: Thu, 16 Feb 2006 01:13:04 -0800	[thread overview]
Message-ID: <7vr763bra7.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <43F438AA.1040508@op5.se> (Andreas Ericsson's message of "Thu, 16 Feb 2006 09:32:42 +0100")

Andreas Ericsson <ae@op5.se> writes:

> Whoa! Columbus and the egg. Strange noone saw it before. It's so
> obvious when you shove it under the nose like that. :)

I wished the pack format were not so dense as we have today.  It
is very expensive to obtain the uncompressed size of a deltified
object.  For this reason, a delta newly created (either from a
non-delta in an existing pack or from a loose object) by the
experimental algorithm is never made against an object that is
in deltified form in a pack.  Also it incurs nontrivial cost to
obtain the size of the in-pack representation of an object
(either deltified or not).  But the inefficiency in the
resulting pack due to these factors may not matter in practice.

I just packed v2.6.16-rc3 object list (184141 objects) using the
current and the experimental, just for fun.  Tonight's one runs
just under 1 minutes on my Duron 750 (with slow disks I should
add).  This was done in a repository that has about 1500 loose
objects and a single mega pack; reuse rate of packed data by the
experimental algorithm is about 99%.  I am hoping the one from
the "master" would come back before I finish writing this
message ;-).

There are subtleties.

For example, in a typical project, files tend to grow rather
than shrink on average, and older ones tend to be in packs.  If
you do packing the traditional way, the largest one (which is
typically the latest) is kept as non-delta, and all the smaller
ones will be incremental delta from that, no matter how your
packs and loose objects are organized.

Usually, you have the latest objects as loose objects in your
repository to be packed (either you push from it, or somebody
else pulls from you).  In other words, as you develop after your
last repack, you would accumulate loose objects, and they are
the ones that typically matter the most.

Let's say you have been changing the same file in every commit
(1..N), then you fully packed and then created another commit
(revision N+1) that touches the file.  The experimental
reuse-packer would:

    - notice blobs from revision 1..(N-1) are deltified,
      relative to the rev one greater than each of them.
      these would be reused.
    - notice blob from revision N is in the pack but not
      deltified.
    - notice blob from revision N+1 is loose.

Then emit the bigger one between N or N+1 as non-delta, the
other one as delta.  1..(N-1) are output as delta.  If it
happens to choose N as plain, it does not have to uncompress and
recompress so the pack process would go very fast, but you would
end up always having to apply a delta to bring rev N to N+1 on
top of the non-delta N to get to the latest blob in rev N+1, and
you typically would want to access rev N+1 blob more often.

In other words, the experimental reuse-packer would create a
suboptimal pack in such a case.  Not a big deal, though.

We may want an option to disable the optimization for
weekly/monthly repacking.  git-daemon (or whatever runs
pack-objects via upload-pack) should use the default with the
optimization, since this is so obviously faster.

> Now that pack-creation went from bizarrely expensive to insanely cheap
> (well, comparable to "tar czf" anyways), what's BCP for packing a
> public repository? Always one mega-pack and never worry, or should one
> still use incremental and sometimes overlapping pack-files?

I would say an optimum single mega-pack would work the best, but
"repack -a -d" to create the mega-pack _with_ the optimization
may have performance impact for users of resulting packs.

Oh, the traditional one finally came back after 11 minutes.

next prev parent reply	other threads:[~2006-02-16  9:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-15  8:39 [FYI] pack idx format Junio C Hamano
2006-02-15 11:16 ` Johannes Schindelin
2006-02-15 16:46 ` Nicolas Pitre
2006-02-16  1:58   ` Junio C Hamano
2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
2006-02-16  1:45   ` [PATCH] packed objects: minor cleanup Junio C Hamano
2006-02-16  3:41   ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre
2006-02-16  3:59     ` Junio C Hamano
2006-02-16  3:55   ` Linus Torvalds
2006-02-16  4:07     ` Junio C Hamano
2006-02-16  8:32   ` Andreas Ericsson
2006-02-16  9:13     ` Junio C Hamano [this message]
2006-02-17  4:30   ` Junio C Hamano
2006-02-17 10:37     ` [PATCH] pack-objects: finishing touches Junio C Hamano
2006-02-18  6:50       ` [PATCH] pack-objects: avoid delta chains that are too long Junio C Hamano
2006-02-17 15:39     ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds
2006-02-17 18:18       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vr763bra7.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=ae@op5.se \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.