Re: [PATCH 0/7] speeding up cat-file by reordering object access

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee <stolee@gmail.com>,
	git@vger.kernel.org, Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH 0/7] speeding up cat-file by reordering object access
Date: Thu, 16 Aug 2018 15:45:38 -0400	[thread overview]
Message-ID: <20180816194537.GA5813@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqmutmdu02.fsf@gitster-ct.c.googlers.com>

On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I think that makes sense. We already see duplicates from
> > for_each_packed_object() when they're in multiple packs, and callers
> > just need to be ready to deal with it (and depending on what you're
> > doing, you may actually _want_ the duplicates).
> 
> You of course would also see dups between loose and packed until
> prune-packed is run.

Yes, although there are two separate calls to look at those two sources,
so it's a bit more obvious that the caller has to de-dup. I'm not
opposed to a flag to ask the function to de-dup for us, but it really
only makes sense if we do a combined for_each_object() call.
De-duping is potentially expensive, and there's little point in
de-duping the packs if we have to then de-dup the result with the loose
objects.

One benefit of having the iterator function de-dup is that it _could_ be
done more efficiently. Right now, even before my recent patches the
cat-file de-dup happens by creating a secondary list, sorting it, and
then skipping the duplicates.

But if a function knew _all_ of the sources we were going to look at
(any loose directories, including alternates, and all available packs),
and it could walk each individual source in sorted order (easy for packs
by .idx, not too bad for loose by sorting the readdir() results), then
we could do it as an online list-merge, with a single walk through the
results.

In practice I don't know if it's worth the effort. If you're accessing
the object contents, sorted order really is bad. And if you're just
collecting the hashes, then you're likely generating some data structure
for future lookups, and you can just de-dup as part of that process.

> I also was thinking about the same thing after Derrick's response,
> but unless you are very specialized caller, it does not allow you do
> very much to learn that object X exists as a loose object locally,
> also as a loose object in our alternate, and also in pack A, but not
> in other packs.  You need a way to say "Read the contents of object
> X from that place, not from any other place", "Remove that copy of
> object X at that place, but not at any other place" etc. to make
> effective use of that kind of information.

We do have those functions, and use them. E.g., fsck uses
read_loose_object() to actually check that particular copy of the
object. That's obviously a specialized caller as you note, but then
anybody iterating over all of the objects is pretty specialized already.

> The codepath that implements runtime access has "I found a copy
> here, but it is unusable, so let's go on to look for another usable
> copy" fallback.  This is a tangent but it is something we should not
> lose in the midx-enabled world.

Yeah, good catch. That's orthogonal to most of this discussion, I think,
but it's a possible downside of the midx because it has literally
discarded those duplicate index entries (or at least that's my
understanding). If we kept the .idx for each pack as a fallback for
older Gits, then it's easy to solve: in the unlikely case the
.midx-referenced copy fails, you look in each individual pack for
another copy.

But I think eventually you'd want to stop generating those .idx files,
too, if you know that you'll only use a modern version of Git. I wonder
if the .midx format should have an extension for "here are all the
duplicates I found". They could even be part of the main index (it's
easy enough for a binary-search lookup to always point to the start of a
run of duplicates), but if you had a ton of duplicates they would slow
your binary searches a little.

I'll leave all that to Stolee to ponder. :)

-Peff

     prev parent reply	other threads:[~2018-08-16 19:45 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
2018-08-10 23:27   ` Stefan Beller
2018-08-10 23:31     ` Jeff King
2018-08-10 23:33       ` Jeff King
2018-08-10 23:39         ` Stefan Beller
2018-08-11  0:33           ` Jeff King
2018-08-10 23:09 ` [PATCH 2/7] for_each_*_object: take flag arguments as enum Jeff King
2018-08-10 23:11 ` [PATCH 3/7] for_each_*_object: give more comprehensive docstrings Jeff King
2018-08-10 23:15 ` [PATCH 4/7] for_each_packed_object: support iterating in pack-order Jeff King
2018-08-15 13:28   ` Derrick Stolee
2018-08-16 17:36     ` Jeff King
2018-08-10 23:16 ` [PATCH 5/7] t1006: test cat-file --batch-all-objects with duplicates Jeff King
2018-08-10 23:17 ` [PATCH 6/7] cat-file: rename batch_{loose,packed}_object callbacks Jeff King
2018-08-10 23:24 ` [PATCH 7/7] cat-file: support "unordered" output for --batch-all-objects Jeff King
2018-08-13 18:45 ` [PATCH 0/7] speeding up cat-file by reordering object access Jonathan Tan
2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
2018-08-14 18:14     ` [PATCH 1/4] cat-file: use oidset check-and-insert Jeff King
2018-08-14 18:18     ` [PATCH 2/4] cat-file: split batch "buf" into two variables Jeff King
2018-08-14 18:20     ` [PATCH 3/4] cat-file: use a single strbuf for all output Jeff King
2018-08-14 19:30       ` René Scharfe
2018-08-14 19:39         ` Jeff King
2018-08-14 18:21     ` [PATCH 4/4] for_each_*_object: move declarations to object-store.h Jeff King
2018-08-15 14:05 ` [PATCH 0/7] speeding up cat-file by reordering object access Derrick Stolee
2018-08-16 17:39   ` Jeff King
2018-08-16 18:52     ` Junio C Hamano
2018-08-16 19:45       ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180816194537.GA5813@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).