git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jakub Narębski" <jnareb@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>,
	Stefan Beller <sbeller@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Marc Strapetz <marc.strapetz@syntevo.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: topological index field for commit objects
Date: Thu, 30 Jun 2016 12:30:31 +0200	[thread overview]
Message-ID: <5774F4C7.805@gmail.com> (raw)
In-Reply-To: <20160629220049.GA4416@sigill.intra.peff.net>

W dniu 2016-06-30 o 00:00, Jeff King pisze:
> On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:
> 
>>> So this is the ideal case for generation numbers (the worst cases are
>>> when the things you are looking for are in branchy, close history where
>>> the generation numbers don't tell you much; but in such cases the
>>> walking is usually not too bad).
>>
>> There are other approaches (special indices) that help reachability
>> queries beside "generation number".
> 
> Yes, though generation numbers can help with more questions (e.g., "what
> is the merge base").

Well, those special indices usually need generation number anyway. For
example FELINE indices^1 (from "Reachability Queries in Very Large Graphs",
http://openproceedings.org/EDBT/2014/paper_166.pdf, that I found when
trying to find more about compressed bitmap indices used by Git) also
utilize generation numbers to speed up reachability analysis. The idea
behind FELINE is to associate every vertex (commit) with a unique ordered
pair of natural integers (i.e. two more numbers, in addition to the level
aka generation number), so that partial order over N^2 is superset of
partial order of graph (DAG of revisions). It can answer in O(1) that
nodes are not connected, and cut search space if they are.

BTW. some references in this paper ("Related work" section) use PWAH
compression scheme - perhaps it would work with EWAH that Git uses?

1. FELINE = Fast rEfined onLINE search ... or fun with acronyms

>> By the way, what should happen if you add a replacement (in the git-replace
>> meaning) that creates a shortcut, therefore invalidating generation numbers,
>> at least in strict sense - committerdate as generation number would be still
>> good, I think?
> 
> This is one of the open questions. My older patches turned them off when
> replacements and grafts are in effect.

Well, if you store the cache of generation numbers in the packfile, or in
the index of the packfile, or in the bitmap file, or in separate bitmap-like
file, generating them on repack, then of course any grafts or replacements
invalidate them... though for low level commands (like object counting)
replacements are transparent -- or rather they are (and can be) treated as
any other ref for reachability analysis.

Well, if there are no grafts, you could still use them for doing
"git --no-replace-objects log ...", isn't it?

SIDENOTE: should grafts be deprecated and later removed, now that Git has
superior alternative in the form of git-replace, and a simple way to convert
grafts to replacements?

Anyway, if file with mapping from revisions to its generation numbers
were stored outside of packfiles, it could simply be updated / rewritten
when adding new replacement. You could use one cache for no-replace,
and one for replace. Though I do wonder if it would be possible to do
such rewrite fast... well, fast enough assuming that adding replacements
is rare.

Answering my own question:
>> committerdate as generation number would be still good, I think?
No, it wouldn't:

  pre replace:

    1 <-- 2 <-- 3 <-- 5 <--
           \
            \----- 4 <-- 6 <-- 7 <-- 8

  post replace

   1 <-- 2                              \-- 3' <-- 5
          \                              \
           \------ 4 <-- 6 <-- 7 <-- 8 <--\

Either the replacement commit would have generation number correct, but
its child wouldn't, or its child would have correct generation number but
the replacement wouldn't.
 
>>> I have patches that generate and store the numbers at pack time, similar
>>> to the way we do the reachability bitmaps.

Ah, so those cached generation numbers are generated and stored at pack
time. Where you store them: is it a separate file? Bitmap file? Packfile?

>>>                                            They're not production ready,
>>> but they could probably be made so without too much effort. You wouldn't
>>> have ready-made generation numbers for commits since the last full
>>> repack, but you can compute them incrementally based on what you do have
>>> at a cost linear to the unpacked commits (this is the same for bitmaps).
>>
>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>> limited to object counting?
> 
> At GitHub we are using them for --contains analysis, along with mass
> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> plan is to send patches upstream, but they need some cleanup first.

That would be nice to have, please.

Er, is mass ahead/behind something that can be plugged into Git
(e.g. for "git branch -v -v"), or is it something GitHub-specific?


P.S. Having Git ensure that committerdate (as an epoch) is greater
than committerdates of its parents at the commit creation time (with
providing warning about time skew, perhaps not doing it if skew is
too large) would be not costly. No need to add anything, and it would
help future queries to be faster, I think.

-- 
Jakub Narębski


  parent reply	other threads:[~2016-06-30 10:31 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-29 18:31 topological index field for commit objects Marc Strapetz
2016-06-29 18:59 ` Junio C Hamano
2016-06-29 20:20   ` Stefan Beller
2016-06-29 20:39     ` Junio C Hamano
2016-06-29 20:54       ` Stefan Beller
2016-06-29 21:37         ` Stefan Beller
2016-06-29 21:43           ` Jeff King
2016-06-29 20:56       ` Jeff King
2016-06-29 21:49         ` Jakub Narębski
2016-06-29 22:00           ` Jeff King
2016-06-29 22:11             ` Junio C Hamano
2016-06-29 22:30               ` Jeff King
2016-07-05 11:43                 ` Johannes Schindelin
2016-07-05 12:59                   ` Jakub Narębski
2016-06-30 10:30             ` Jakub Narębski [this message]
2016-06-30 18:12               ` Linus Torvalds
2016-06-30 23:39                 ` Jakub Narębski
2016-06-30 23:59                 ` Mike Hommey
2016-07-01  3:17                 ` Jeff King
2016-07-01  6:45                   ` Marc Strapetz
2016-07-01  9:48                   ` Jakub Narębski
2016-07-01 16:08                   ` Junio C Hamano
2016-07-01  6:54               ` Jeff King
2016-07-01  9:59                 ` Jakub Narębski
2016-07-20  0:07             ` Jakub Narębski
2016-07-20 13:02               ` Jeff King
2017-02-04 13:43                 ` Jakub Narębski
2017-02-17  9:26                   ` Jeff King
2017-02-17  9:28                     ` Jakub Narębski
2016-06-29 22:15       ` Marc Strapetz
2016-06-29 21:00   ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5774F4C7.805@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=marc.strapetz@syntevo.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).