Re: [PATCH 1/3] Lazily open pack index files on demand

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Dana How" <danahow@gmail.com>
To: "Junio C Hamano" <junkio@cox.net>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	git@vger.kernel.org, danahow@gmail.com
Subject: Re: [PATCH 1/3] Lazily open pack index files on demand
Date: Sat, 26 May 2007 10:31:00 -0700	[thread overview]
Message-ID: <56b7f5510705261031o311b89bapd730374cbc063931@mail.gmail.com> (raw)
In-Reply-To: <7vabvsm1h8.fsf@assigned-by-dhcp.cox.net>

On 5/26/07, Junio C Hamano <junkio@cox.net> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> >  This conflicts (in a subtle way) with Dana How's
> >  "sha1_file.c:rearrange_packed_git() should consider packs' object
> >  sizes" patch as we now have num_objects = 0 for any indexes we
> >  have not opened.  In the case of Dana's patch this would cause
> >  those packfiles to have very high ranks, possibly sorting much
> >  later than they should have.
> I am keeping that rearrange stuff on hold, partly because I am
> moderately hesitant to do the fp, which feels overkill at that
> low level of code.
Oh,  I thought the fp might cause a gag reflex -- I had to add -lm.
Unfortunately,  when trying to automatically detect and grade outliers,
which is what I was trying to do,  (datum - mean) / std_dev is hard to beat,
and I needed sqrt for std_dev -- all other fp could be easily written out.

> Also, I am hoping that we can discard that the object density
> criteria altogether by making the default repack behaviour
> friendlier to the pathological cases, e.g. by emitting huge
> blobs at the end of the packstream, potentially pushing it out
> to later parts of split packs by themselves and automatically
> marking them with the .keep flag.  Until that kind of
> improvements materialize, people with pathological cases could
> (1) handcraft a pack that contains only megablob, (2) place that
> on central alternate, (3) touch it with artificially old
> timestamp, which hopefully is a good enough workaround.
I think we should do what we can to make the timestamp as
meaningful as possible,  which is why I submitted that stamping patch.

I think there are two interesting strategies compatible
with maximally-informative timestamps:

(1) git-repack -a -d repacks everything on each call.  You would need:
(1a) Rewrite builtin-pack-objects.c so only the object_ix hash
       accesses the "objects" array directly, everything else
       goes through a pointer table.
(1b) Sort the new pointer table by object type,  in order
       tag -> commit -> tree -> nice blob -> naughty blob.
      The sort is stable so the order within each group is unchanged.
(1c) Do not deltify naughty blobs.  Naughty blobs are those
      blobs marked "nodelta" or very large blobs.
(1d) Write out objects in new pointer table order.  Splitting
       will cause metadata to be in first pack,  naughty blobs
       tend to be in the last pack.
(1e) When done writing all packs,  swap their timestamps
      so current timestamp sorting will look at naughty blobs last.

(2) git-repack -a -d runs in two passes and maintains .keep files:
(2a) Add a new flag --types=[gctb]+ to pack-objects to be supplied
      by git-repack.  This means only taGs/Commits/Trees/Blobs
      are to be passed,  all others dropped.
(2b) Put a new loop around the core of git-repack.  In the first iteration,
      pack with --types=b, then with --types=gct in the second.
      Thus metadata will have more recent timestamp.
(2c) If packs are split, also swap timestamps like in (1e),
       within each iteration.
(2d) If an iteration produces split packs, mark all but the last
      in the sequence with a .keep file automatically.  The
      .keep files contain the string "repack".
(2e) Add new option to repack: -A.  If specified,  the first
      thing repack does is remove any keep file containing "repack".
(2f) The existing response of repack to keep files -- do not repack them --
     is retained to ensure on each -a/not -A repack,  we only
     repack the tail of each set of packs: metadata, data.
     The metadata set will probably only ever contain one pack
     and will always be repacked.

I've (badly) implemented (1b) and confirmed it had no impact
on linux-2.6 repo.  I've also implemented (2a), (2b), (2d), and (2f),
but not fully measured them.  I'd like to finish this work,  but
"megapacks" are very time-consuming to manipulate,  and
with the loose megablob approach they are not as useful for me.

Finally,  some people might want more esoteric repacking
strategies than what I've listed above.  We could add a
--packed flag to pack-objects to help them.  This means that
git pack-objects --packed --unpacked=<pack1> --unpacked=<pack2>
would only repack pack1 and pack2 and would not absorb
any loose blobs.  This would allow you to maintain any number
of packfile classes you want and maintain them yourself.
Each would be indicated by something different in a .keep file.
(To newly absorb loose blobs in a class,  you would do
 cat object-list | git-pack-objects --incremental
from some object-list you built following your rules).
These strategies would be too special-purpose to be in git,
but adding --packed is a small and useful change.

Shawn:  When I first saw the index-loading code,  my first
thought was that all the index tables should be
merged (easy since sorted) so callers only need to do one search.
With indices loaded lazily,  either you can't merge,  or you
merge sequentially, raising merge cost from (total entries) to
almost (index files) * (total entries).  What do you think about
merging the SHA-1 tables,  and how would/should it interact with
lazy index file loading?

BTW,  if it's not apparent,  I think my object density patch
should be dropped.  It has served its purpose as a thought experiment.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

next prev parent reply	other threads:[~2007-05-26 18:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-26  5:24 [PATCH 1/3] Lazily open pack index files on demand Shawn O. Pearce
2007-05-26  8:29 ` Junio C Hamano
2007-05-26 17:30   ` Shawn O. Pearce
2007-05-26 17:31   ` Dana How [this message]
2007-05-27  2:43     ` Nicolas Pitre
2007-05-27  4:31       ` Dana How
2007-05-27 14:41         ` Nicolas Pitre
2007-05-27  3:34     ` Shawn O. Pearce
2007-05-27  4:40       ` Dana How
2007-05-27 15:29         ` Nicolas Pitre
2007-05-27 21:35           ` Shawn O. Pearce
2007-05-28  1:35             ` Dana How
2007-05-28  2:30               ` A Large Angry SCM
2007-05-28 18:31               ` Nicolas Pitre
2007-05-28  2:18             ` Nicolas Pitre
2007-05-27 15:26       ` Nicolas Pitre
2007-05-27 16:06         ` Dana How
2007-05-27 21:52         ` Shawn O. Pearce
2007-05-27 23:35           ` Nicolas Pitre
2007-05-28 16:22             ` Linus Torvalds
2007-05-28 17:13               ` Nicolas Pitre
2007-05-28 17:40               ` Karl Hasselström
  -- strict thread matches above, loose matches on Subject: below --
2007-05-27 10:46 Martin Koegler
2007-05-27 15:36 ` Nicolas Pitre
2007-05-29  0:09 linux
2007-05-29  3:26 ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56b7f5510705261031o311b89bapd730374cbc063931@mail.gmail.com \
    --to=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).